diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
index bbcbbe7d..f6413025 100644
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -7,14 +7,14 @@ assignees: ''
 
 ---
 
-**Is your feature request related to a problem? Please describe.**
-A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+**Describe your feature request**
 
-**Describe the solution you'd like**
-A clear and concise description of what you want to happen.
 
-**Describe alternatives you've considered**
-A clear and concise description of any alternative solutions or features you've considered.
+**Describe the reference code or paper**
+
+
+**Describe the possible solution**
+
 
 **Additional context**
 Add any other context or screenshots about the feature request here.
diff --git a/.gitignore b/.gitignore
index a0745fe1..b1f2836b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -128,13 +128,27 @@ dmypy.json
 # Pyre type checker
 .pyre/
 
-# segmentation
-./segmentation/output
-./segmentation/log
+# Paddle checkpoints
 *.pdparams
 *.pdopt
-./segmentation/pytorch_2_paddle.py
-./segmentation/readme.txt 
-setr.py.bak
-
 
+# Segmentation
+/semantic_segmentation/output
+/semantic_segmentation/log
+/semantic_segmentation/data
+/semantic_segmentation/tmp
+/semantic_segmentation/pytorch_2_paddle.py
+/semantic_segmentation/config.ini
+/semantic_segmentation/job.sh
+/semantic_segmentation/run.sh
+
+# Image Classification
+output/
+
+# macOS
+.DS_Store
+*/.DS_Store
+
+# Editor Config
+.idea/
+.vscode/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 359f5483..0492b0f2 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,12 +1,15 @@
+English | [简体中文](./CONTRIBUTING_cn.md)
+
 # Contribute Code
 
-You encourage and appreciate researchers and developers to contribute to project **PPViT**. 
+You encourage and appreciate researchers and developers to contribute to project **PaddleViT**. 
+To contribute to PaddlePaddle, you have to agree with the [PaddleViT Contributor License Agreement](https://cla-assistant.io/BR-IDL/PaddleViT).
 
 This document explains our workflow and working style.
 
 ## Workflow
 
-PPViT uses this [Git branching model](http://nvie.com/posts/a-successful-git-branching-model/).  You can follow the listed steps for common contributions.
+PaddleViT uses this [Git branching model](http://nvie.com/posts/a-successful-git-branching-model/).  You can follow the listed steps for common contributions.
 
 ### 1. Fork the repo
 
@@ -52,7 +55,7 @@ PPViT uses this [Git branching model](http://nvie.com/posts/a-successful-git-bra
    An experienced Git user pulls from the official repo often -- daily or even hourly, so they notice conflicts with others work early, and it's easier to resolve smaller conflicts.
 
    ```bash
-   $ git remote add upstream https://github.com/xperzy/PPViT
+   $ git remote add upstream https://github.com/BR-IDL/PaddleViT
    $ git pull upstream develop
    ```
 
diff --git a/CONTRIBUTING_cn.md b/CONTRIBUTING_cn.md
new file mode 100644
index 00000000..f2f8b818
--- /dev/null
+++ b/CONTRIBUTING_cn.md
@@ -0,0 +1,124 @@
+简体中文 | [English](./CONTRIBUTING.md)
+
+## 贡献代码 
+
+鼓励并感谢为**PaddleViT**项目提供贡献的研究人员和开发人员。
+
+您需要同意[PaddleViT 参与者许可协议](https://cla-assistant.io/BR-IDL/PaddleViT)，方可以参与PaddlePaddle贡献。
+
+本文档描述了我们的工作流程和代码风格。
+
+## 工作流程
+
+PaddleViT 使用这个[Git 分支模型](http://nvie.com/posts/a-successful-git-branching-model/).  您可以按照以下步骤提交代码并参与贡献.
+
+### 1. Fork 
+
+  请从您的fork中提交 `Pull Requests` . 
+  
+  只需要前往我们的 GitHub repo 页面并点击 ["Fork"](https://help.github.com/articles/fork-a-repo/) 按钮.
+   
+### 2. 克隆 (Clone)
+
+   将您的fork复制到本地:
+
+   ```bash
+   $ git clone https://github.com/your-github-account/PPViT
+   $ cd PPViT
+   ```
+
+### 3. 创建本地 `feature` 分支
+
+   对于日常工作例如添加新功能或修复错误，请在编码之前基于`develop`分支创建一个 `feature` 分支:
+
+   ```bash
+   $ git checkout develop
+   $ git checkout -b feature
+   ```
+   其中`feature` 可以替换为你正在处理的功能的名称.
+
+### 4. 提交 (Commit)
+
+   `during and after` 您的更改，将代码提交到本地存储库.
+
+   ```shell
+   $ git add -A
+   $ git commit -m “message”
+   ```
+  
+### 5. 测试
+
+   - 我们鼓励编写`unittest` 来测试你编写的类与方法的实现.
+   - 在开始合并之前，请在相关数据集上测试模型的性能。
+ 
+### 6. 保持本地仓库最新 (Keep Pulling)
+   在准备发起Pull Request之前，需要同步原仓库中最新的代码。
+
+   有经验的Git用户会经常从官方存储库中pull数据--每天甚至每小时，因此他们会尽早注意到与其他人的工作冲突，并且更容易解决较小的冲突。
+
+   ```bash
+   $ git remote add upstream https://github.com/BR-IDL/PaddleViT
+   $ git pull upstream develop
+   ```
+
+### 7. Push 以及 file a `Pull Request`
+
+   1. **Push** 您的本地工作到您的fork仓库中:
+
+      ```bash
+      $ git push origin my-cool-stuff
+      ```
+      > 其中，`my-cool-stuff`是您的分支名称
+      
+      push操作允许您创建一个pull request,请求此 [official repo](https://github.com/BR-IDL/PaddleViT) 将您的更改拉入到官方库中.
+
+   2. 想要创建一个`Pull Request`, 请按照 [这些步骤](https://help.github.com/articles/creating-a-pull-request/).
+
+      如果您的更改是`fixing an issue`, 请在pull request的描述部分写下["Fixes <issue-URL>"](https://help.github.com/articles/closing-issues-using-keywords/).  当合并您的 pull request时，Github将关闭该问题.
+
+      请记住为您的pull request指定审阅者.  如果您不知道正确的选择，请遵循Github的推荐.
+
+### 8. 删除本地和远程 `feature` 分支
+
+   成功合并到`develop`分支后，删除您的`feature` 分支。
+   为了保持您的本地工作区和fork简洁，您可能想要删除合并的分支：
+
+   ```bash
+   $ git push origin :my-cool-stuff
+   $ git checkout develop
+   $ git pull upstream develop
+   $ git branch -d my-cool-stuff
+   ```
+
+## 代码审查
+
+-  请随时通过 IM 或电子邮件来 ping 您的审阅者以发送您的pull request.
+
+- 请回答审阅者的每一条评论. 如果您要关注评论，请写“完成”；否则请给出理由。
+
+- 如果您不希望您的审阅者被电子邮件通知淹没，可以通过 [批量](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/) 回复评论.
+
+- 减少非必要的提交.  存在一些开发人员经常提交，建议通过运行 `git commit --amend` 代替 `git commit`，将一系列小的变动附加到一个提交中.
+
+## Coding Standard
+
+### Code Style
+
+我们的Python代码遵循 [PEP8 language guide](https://zh-google-styleguide.readthedocs.io/en/latest/google-python-styleguide/python_language_rules/) 以及 [PEP8 style guide](https://zh-google-styleguide.readthedocs.io/en/latest/google-python-styleguide/python_style_rules/).
+  
+### Use Pylint
+
+[Pylint](http://pylint.pycqa.org/en/latest/) 是一款 Python代码分析工具，可以分析Python代码中的错误，发现不符合编程标准或存在潜在问题的代码。 
+
+### Comments and Annotations
+  
+为了让其他人更容易使用并生成在线文件，请在每个类方法的每个函数中包含文档的描述字符串。
+  
+### 单元测试
+
+请记得添加相关的单元测试
+
+- 对于 Python 代码, 请使用 [Python's standard `unittest` package](http://pythontesting.net/framework/unittest/unittest-introduction/).
+
+尝试对每个类方法的每个函数都进行单元测试。
+  
diff --git a/PaddleViT.png b/PaddleViT.png
index 99c80984..fd3ed508 100644
Binary files a/PaddleViT.png and b/PaddleViT.png differ
diff --git a/README.md b/README.md
index 1300a6a3..9cac6a5f 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,10 @@
+English | [简体中文](./README_cn.md)
+
 # PaddlePaddle Vision Transformers #
 
 [![GitHub](https://img.shields.io/github/license/BR-IDL/PaddleViT?color=blue)](./LICENSE)
+[![CodeFactor](https://www.codefactor.io/repository/github/br-idl/paddlevit/badge)](https://www.codefactor.io/repository/github/br-idl/paddlevit)
+[![CLA assistant](https://cla-assistant.io/readme/badge/BR-IDL/PaddleViT)](https://cla-assistant.io/BR-IDL/PaddleViT)
 [![GitHub Repo stars](https://img.shields.io/github/stars/BR-IDL/PaddleViT?style=social)](https://github.com/BR-IDL/PaddleViT/stargazers)
 
 
@@ -14,7 +18,7 @@
 
 :robot: PaddleViT provides models and tools for multiple vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.
 
-:robot: PaddleViT is backed by popular deep learning framework [PaddlePaddle](https://www.paddlepaddle.org/), we also provide tutorials and projects on [Paddle AI Studio](https://aistudio.baidu.com/aistudio/index). It's intuitive and straightforward to get started for new users.
+:robot: PaddleViT is backed by popular deep learning framework [PaddlePaddle](https://www.paddlepaddle.org/), we also provide tutorials and projects on [Paddle AI Studio](https://aistudio.baidu.com/aistudio/course/introduce/25102). It's intuitive and straightforward to get started for new users.
 
 
 ## Quick Links ##
@@ -26,9 +30,31 @@ PaddleViT implements model architectures and tools for multiple vision tasks, go
   
 We also provide tutorials:
 - Notebooks (coming soon)
-- Online Course (coming soon)
+- [Online Course](https://aistudio.baidu.com/aistudio/course/introduce/25102): on Paddle AIStudio (in chinese version)
+
+## Features ##
+1. **State-of-the-art**
+   - State-of-the-art transformer models for multiple CV tasks
+   - State-of-the-art data processings and training methods 
+   - We keep pushing it forward.
+
+2. **Easy-to-use tools**
+   - Easy configs for model vairants
+   - Modular design for utiliy functions and tools
+   - Low barrier for educators and practitioners
+   - Unified framework for all the models
+
+3. **Easily customizable to your needs**
+   - Examples for each model to reproduce the results
+   - Model implementations are exposed for you to customize
+   - Model files can be used independently for quick experiments
 
+4. **High Performance**
+   - DDP (multiprocess training/validation where each process runs on a single GPU).
+   - Mixed-precision support (AMP)
+  
 
+  
 ## Model architectures ##
 
 ### Image Classification (Transformers) ###
@@ -43,31 +69,44 @@ We also provide tutorials:
 8. **[Shuffle Transformer](./image_classification/Shuffle_Transformer)** (from Tencent), released with paper [Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer](https://arxiv.org/abs/2106.03650), by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
 9. **[T2T-ViT](./image_classification/T2T_ViT)** (from NUS and YITU), released with paper [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
 ](https://arxiv.org/abs/2101.11986), by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.
-
-#### Coming Soon: ####
-1. **[CrossViT]()** (from IBM), released with paper [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899), by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
-2. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
-3. **[HaloNet]()**, (from Google), released with paper [Scaling Local Self-Attention for Parameter Efficient Visual Backbones](https://arxiv.org/abs/2103.12731), by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
-
-
-### Image Classification (MLPs) ###
+10. **[CrossViT](./image_classification/CrossViT)** (from IBM), released with paper [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899), by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
+11. **[BEiT](./image_classification/BEiT)** (from Microsoft Research), released with paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254), by Hangbo Bao, Li Dong, Furu Wei.
+12. **[Focal Transformer](./image_classification/Focal_Transformer)** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+13. **[Mobile-ViT](./image_classification/MobileViT)** (from Apple), released with paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178), by Sachin Mehta, Mohammad Rastegari.
+14. **[ViP](./image_classification/ViP)** (from National University of Singapore), released with [Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition](https://arxiv.org/abs/2106.12368), by Qibin Hou and Zihang Jiang and Li Yuan and Ming-Ming Cheng and Shuicheng Yan and Jiashi Feng.
+15. **[XCiT](./image_classification/XCiT)** (from Facebook/Inria/Sorbonne), released with paper [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681), by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.
+16. **[PiT](./image_classification/PiT)** (from NAVER/Sogan University), released with paper [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302), by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.
+17. **[HaloNet](./image_classification/HaloNet)**, (from Google), released with paper [Scaling Local Self-Attention for Parameter Efficient Visual Backbones](https://arxiv.org/abs/2103.12731), by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
+18. **[PoolFormer](./image_classification/PoolFormer)**, (from Sea AI Lab/NUS), released with paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418), by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan.
+19. **[BoTNet](./image_classification/BoTNet)**, (from UC Berkeley/Google), released with paper [Bottleneck Transformers for Visual Recognition](https://arxiv.org/abs/2101.11605), by Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani.
+20. **[CvT](./image_classification/CvT)** (from McGill/Microsoft), released with paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808), by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
+21. **[HvT](./image_classification/HVT)** (from Monash University), released with paper [Scalable Vision Transformers with Hierarchical Pooling](https://arxiv.org/abs/2103.10619), by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.
+
+
+### Image Classification (MLP & others) ###
 1. **[MLP-Mixer](./image_classification/MLP-Mixer)** (from Google), released with paper [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601), by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
 2. **[ResMLP](./image_classification/ResMLP)** (from Facebook/Sorbonne/Inria/Valeo), released with paper [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404), by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
 3. **[gMLP](./image_classification/gMLP)** (from Google), released with paper [Pay Attention to MLPs](https://arxiv.org/abs/2105.08050), by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
+4. **[FF Only](./image_classification/FF_Only)** (from Oxford), released with paper [Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet](https://arxiv.org/abs/2105.02723), by Luke Melas-Kyriazi.
+5. **[RepMLP](./image_classification/RepMLP)** (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883), by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
+6. **[CycleMLP](./image_classification/CycleMLP)** (from HKU/SenseTime), released with paper [CycleMLP: A MLP-like Architecture for Dense Prediction](https://arxiv.org/abs/2107.10224), by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
+7. **[ConvMixer](./image_classification/ConvMixer)** (from Anonymous), released with [Patches Are All You Need?](https://openreview.net/forum?id=TVHS5Y4dNvM), by Anonymous.
+8. **[ConvMLP](./image_classification/ConvMLP)** (from UO/UIUC/PAIR), released with [ConvMLP: Hierarchical Convolutional MLPs for Vision](https://arxiv.org/abs/2109.04454), by Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi.
 
 
-
+#### *Coming Soon:* ####
+1. **[DynamicViT]()** (from Tsinghua/UCLA/UW), released with paper [DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification](https://arxiv.org/abs/2106.02034), by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh.
 
 
 
 ### Detection ###
 1. **[DETR](./object_detection/DETR)** (from Facebook), released with paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872), by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
+2. **[Swin Transformer](./object_detection/Swin)** (from Microsoft), released with paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030), by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
+3. **[PVTv2](./object_detection/PVTv2)** (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper [PVTv2: Improved Baselines with Pyramid Vision Transformer](https://arxiv.org/abs/2106.13797), by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
 
 #### Coming Soon: ####
-1. **[Swin Transformer]()** (from Microsoft), released with paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030), by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
-2. **[PVTv2]()** (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper [PVTv2: Improved Baselines with Pyramid Vision Transformer](https://arxiv.org/abs/2106.13797), by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
-3. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
-4. **[UP-DETR]()** (from Tencent), released with paper [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers](https://arxiv.org/abs/2011.09094), by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.
+1. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+2. **[UP-DETR]()** (from Tencent), released with paper [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers](https://arxiv.org/abs/2011.09094), by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.
 
 
 
@@ -80,12 +119,12 @@ We also provide tutorials:
 2. **[Segmenter](./semantic_segmentation)** (from Inria), realeased with paper [Segmenter: Transformer for Semantic Segmentation](https://arxiv.org/pdf/2105.05633.pdf), by Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid.
 3. **[Trans2seg](./semantic_segmentation)** (from HKU/Sensetime/NJU), released with paper [Segmenting Transparent Object in the Wild with Transformer](https://arxiv.org/pdf/2101.08461.pdf), by Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo.
 4. **[SegFormer](./semantic_segmentation)** (from HKU/NJU/NVIDIA/Caltech), released with paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203), by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+5. **[CSwin Transformer]()** (from USTC and Microsoft), released with paper [CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
 
 #### Coming Soon:  ####
 1. **[FTN]()** (from Baidu), released with paper [Fully Transformer Networks for Semantic Image Segmentation](https://arxiv.org/pdf/2106.04108.pdf), by Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo.
 2. **[Shuffle Transformer]()** (from Tencent), released with paper [Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer](https://arxiv.org/abs/2106.03650), by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
 3. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
-4. **[CSwin Transformer]()** (from USTC and Microsoft), released with paper [CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
 ](https://arxiv.org/abs/2107.00652), by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
 
 
@@ -103,6 +142,7 @@ We also provide tutorials:
 * Python 3.6/3.7
 * PaddlePaddle 2.1.0+
 * CUDA10.2+
+> Note: It is recommended to install the latest version of PaddlePaddle to avoid some CUDA errors for  PaddleViT training. For PaddlePaddle, please refer to this [link](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) for stable version installation and this [link](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html#gpu) for develop version installation. 
 ### Installation
 1. Create a conda virtual environment and activate it.
    ```shell
@@ -118,11 +158,18 @@ We also provide tutorials:
 3. Install dependency packages
     * General dependencies:
         ```
-        pip install yacs, yaml
+        pip install yacs pyyaml
         ```
     * Packages for Segmentation:
         ```
-        pip install cityscapesScripts, detail
+        pip install cityscapesScripts
+        ```
+        Install `detail` package:
+        ```shell
+        git clone https://github.com/ccvl/detail-api
+        cd detail-api/PythonAPI
+        make
+        make install
         ```
     * Packages for GAN:
         ```
@@ -134,63 +181,208 @@ We also provide tutorials:
     ```
 
 
+## Results (Model Zoo) ## 
+### Image Classification ###
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop pct | Interp | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| vit_base_patch32_224          | 80.68 | 95.61 | 88.2M   | 4.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1DPEhEuu9sDdcmOPukQbR7ZcHq2bxx9cr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ppOLj5SWlJmA-NjoLCoYIw)(ubyr) |
+| vit_base_patch32_384          | 83.35 | 96.84 | 88.2M   | 12.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1nCOSwrDiFBFmTkLEThYwjL9SfyzkKoaf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jxnL00ocpmdiPM4fOu4lpg)(3c2f) |
+| vit_base_patch16_224          | 84.58 | 97.30 | 86.4M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/13D9FqU4ISsGxWXURgKW9eLOBV-pYPr-L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ms3o2fHMQpIoVqnEHitRtA)(qv4n) |
+| vit_base_patch16_384          | 85.99 | 98.00 | 86.4M   | 49.8G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kWKaAgneDx0QsECxtf7EnUdUZej6vSFT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15ggLdiL98RPcz__SXorrXA)(wsum) |
+| vit_large_patch16_224         | 85.81 | 97.82 | 304.1M  | 59.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1jgwtmtp_cDWEhZE-FuWhs7lCdpqhAMft/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1HRxUJAwEiKgrWnJSjHyU0A)(1bgk) |
+| vit_large_patch16_384         | 87.08 | 98.30 | 304.1M  | 175.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zfw5mdiIm-mPxxQddBFxt0xX-IR-PF2U/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KvxfIpMeitgXAUZGr5HV8A)(5t91) |
+| vit_large_patch32_384         | 81.51 | 96.09 | 306.5M  | 44.4G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Py1EX3E35jL7DComW-29Usg9788BB26j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1W8sUs0pObOGpohP4vsT05w)(ieg3) |
+| | | | | | | | | |
+| swin_t_224   					| 81.37 | 95.54 | 28.3M   | 4.4G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1v_wzWv3TaQ0RKkKwRQwuDPzwpOb_jGEs/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tbc751RVh3fIRsrLzrmeOw)(h2ac) |
+| swin_s_224   					| 83.21 | 96.32 | 49.6M   | 8.6G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1lrODzr8zIOU9sBrH2x3zolMOS4mv4o7x/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rlXL0tjLWbWnkIt_2Ne8Jw)(ydyx) |
+| swin_b_224   					| 83.60 | 96.46 | 87.7M   | 15.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1hjEVODThNEDAlIqkg8C1KzUh3KsVNu6R/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ucSHBiuiG2sHAmR1N1JENQ)(h4y6) |
+| swin_b_384   					| 84.48 | 96.89 | 87.7M   | 45.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1szLgwhB6WJu02Me6Uyz94egk8SqKlNsd/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1t0oXbqKNwpUAMJV7VTzcNw)(7nym) |
+| swin_b_224_22kto1k    		| 85.27 | 97.56 | 87.7M   | 15.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1FhdlheMUlJzrZ7EQobpGRxd3jt3aQniU/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KBocL_M6YNW1ZsK-GYFiNw)(6ur8) |
+| swin_b_384_22kto1k    		| 86.43 | 98.07 | 87.7M   | 45.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zVwIrJmtuBSiSVQhUeblRQzCKx-yWNCA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NziwdsEJtmjfGCeUFgtZXA)(9squ) |
+| swin_l_224_22kto1k    		| 86.32 | 97.90 | 196.4M  | 34.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1yo7rkxKbQ4izy2pY5oQ5QAnkyv7zKcch/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GsUJbSkGxlGsBYsayyKjVg)(nd2f) |
+| swin_l_384_22kto1k    		| 87.14 | 98.23 | 196.4M  | 100.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1-6DEvkb-FMz72MyKtq9vSPKYBqINxoKK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JLdS0aTl3I37oDzGKLFSqA)(5g5e) |
+| | | | | | | | | |
+| deit_tiny_distilled_224   	| 74.52 | 91.90 | 5.9M    | 1.1G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1fku9-11O_gQI7UpZTjagVeND-pcHbV0C/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hAQ_85wWkqQ7sIGO1CmO9g)(rhda) |
+| deit_small_distilled_224  	| 81.17 | 95.41 | 22.4M   | 4.3G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1RIeWTdf5o6pwkjqN4NbW91GZSOCalI5t/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wCVrukvwxISAGGjorPw3iw)(pv28) |
+| deit_base_distilled_224  		| 83.32 | 96.49 | 87.2M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/12_x6-NN3Jde2BFUih4OM9NlTwe9-Xlkw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ZnmAWgT6ewe7Vl3Xw_csuA)(5f2g) |
+| deit_base_distilled_384  		| 85.43 | 97.33 | 87.2M   | 49.9G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1i5H_zjSdHfM-Znv89DHTv9ChykWrIt8I/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PQsQIci4VCHY7l2tCzMklg)(qgj2) |
+| | | | | | | | | |
+| volo_d1_224  					| 84.12 | 96.78 | 26.6M   | 6.6G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kNNtTh7MUWJpFSDe_7IoYTOpsZk5QSR9/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1EKlKl2oHi_24eaiES67Bgw)(xaim) |
+| volo_d1_384  					| 85.24 | 97.21 | 26.6M   | 19.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1fku9-11O_gQI7UpZTjagVeND-pcHbV0C/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qZWoFA7J89i2aujPItEdDQ)(rr7p) |
+| volo_d2_224  					| 85.11 | 97.19 | 58.6M   | 13.7G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1KjKzGpyPKq6ekmeEwttHlvOnQXqHK1we/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JCK0iaYtiOZA6kn7e0wzUQ)(d82f) |
+| volo_d2_384  					| 86.04 | 97.57 | 58.6M   | 40.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1uLLbvwNK8N0y6Wrq_Bo8vyBGSVhehVmq/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1e7H5aa6miGpCTCgpK0rm0w)(9cf3) |
+| volo_d3_224  					| 85.41 | 97.26 | 86.2M   | 19.8G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1OtOX7C29fJ20ESKQnYGevp4euxhmXKAT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1vhARtV2wfI6EFf0Ap71xwg)(a5a4) |
+| volo_d3_448  					| 86.50 | 97.71 | 86.2M   | 80.3G  | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lHlYhra1NNp0dp4NWaQ9SMNNmw-AxBNZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Q6KiQw4Vu1GPm5RF9_eycg)(uudu) |
+| volo_d4_224  					| 85.89 | 97.54 | 192.8M  | 42.9G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/16oXN7xuy-mkpfeD-loIVOK95PfptHhpX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PE83ZLd5evkKmHJ1V2KDsg)(vcf2) |
+| volo_d4_448  					| 86.70 | 97.85 | 192.8M  | 172.5G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1N9-1OhPewA5TBR9CX5oA10obDS8e4Cfa/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QoJ2Sqe1SK9hxbmV4uZiyg)(nd4n) |
+| volo_d5_224  					| 86.08 | 97.58 | 295.3M  | 70.6G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1fcrvOGbAmKUhqJT-pU3MVJZQJIe4Qina/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nqDcXMW00v9PKr3RQI-g1w)(ymdg) |
+| volo_d5_448  					| 86.92 | 97.88 | 295.3M  | 283.8G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1aFXEkpfLhmQlDQHUYCuFL8SobhxUzrZX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1K4FBv6fnyMGcAXhyyybhgw)(qfcc) |
+| volo_d5_512  					| 87.05 | 97.97 | 295.3M  | 371.3G | 512        | 1.15     | bicubic       | [google](https://drive.google.com/file/d/1CS4-nv2c9FqOjMz7gdW5i9pguI79S6zk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16Wseyiqvv0MQJV8wwFDfSA)(353h) |
+| | | | | | | | | |
+| cswin_tiny_224  				| 82.81 | 96.30 | 22.3M   | 4.2G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1l-JY0u7NGyD6SjkyiyNnDx3wFFT1nAYO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L5FqU7ImWAhQHAlSilqVAw)(4q3h) |
+| cswin_small_224 				| 83.60 | 96.58 | 34.6M   | 6.5G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/10eEBk3wvJdQ8Dy58LvQ11Wk1K2UfPy-E/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FiaNiWyAuWu1IBsUFLUaAw)(gt1a) |
+| cswin_base_224  				| 84.23 | 96.91 | 77.4M   | 14.6G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YufKh3DKol4-HrF-I22uiorXSZDIXJmZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koy8hXyGwvgAfUxdlkWofg)(wj8p) |
+| cswin_base_384  				| 85.51 | 97.48 | 77.4M   | 43.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1qCaFItzFoTYBo-4UbGzL6M5qVDGmJt4y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WNkY7o_vP9KJ8cd5c7n2sQ)(rkf5) |
+| cswin_large_224 				| 86.52 | 97.99 | 173.3M  | 32.5G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1V1hteGK27t1nI84Ac7jdWfydBLLo7Fxt/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KgIX6btML6kPiPGkIzvyVA)(b5fs) |
+| cswin_large_384 				| 87.49 | 98.35 | 173.3M  | 96.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LRN_6qUz71yP-OAOpN4Lscb8fkUytMic/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1eCIpegPj1HIbJccPMaAsew)(6235) |
+| | | | | | | | | |
+| cait_xxs24_224                | 78.38 | 94.32 | 11.9M   | 2.2G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LKsQUr824oY4E42QeUEaFt41I8xHNseR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YIaBLopKIK5_p7NlgWHpGA)(j9m8) |
+| cait_xxs36_224                | 79.75 | 94.88 | 17.2M   | 33.1G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zZx4aQJPJElEjN5yejUNsocPsgnd_3tS/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1pdyFreRRXUn0yPel00-62Q)(nebg) |
+| cait_xxs24_384                | 80.97 | 95.64 | 11.9M   | 6.8G   | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1J27ipknh_kwqYwR0qOqE9Pj3_bTcTx95/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1uYSDzROqCVT7UdShRiiDYg)(2j95) |
+| cait_xxs36_384                | 82.20 | 96.15 | 17.2M   | 10.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/13IvgI3QrJDixZouvvLWVkPY0J6j0VYwL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GafA8B6T3h_vtmNNq2HYKg)(wx5d) |
+| cait_s24_224                  | 83.45 | 96.57 | 46.8M   | 8.7G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1sdCxEw328yfPJArf6Zwrvok-91gh7PhS/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BPsAMEcrjtnbOnVDQwZJYw)(m4pn) |
+| cait_xs24_384                 | 84.06 | 96.89 | 26.5M   | 15.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zKL6cZwqmvuRMci-17FlKk-lA-W4RVte/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w10DPJvK8EwhOCm-tZUpww)(scsv) |
+| cait_s24_384                  | 85.05 | 97.34 | 46.8M   | 26.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1klqBDhJDgw28omaOpgzInMmfeuDa7NAi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-aNO6c7Ipm9x1hJY6N6G2g)(dnp7) |
+| cait_s36_384                  | 85.45 | 97.48 | 68.1M   | 39.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1m-55HryznHbiUxG38J2rAa01BYcjxsRZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-uWg-JHLEKeMukFFctoufg)(e3ui) |
+| cait_m36_384                  | 86.06 | 97.73 | 270.7M  | 156.2G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1WJjaGiONX80KBHB3YN8mNeusPs3uDhR2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1aZ9bEU5AycmmfmHAqZIaLA)(r4hu) |
+| cait_m48_448                  | 86.49 | 97.75 | 355.8M  | 287.3G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lJSP__dVERBNFnp7im-1xM3s_lqEe82-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/179MA3MkG2qxFle0K944Gkg)(imk5) |
+| | | | | | | | | |
+| pvtv2_b0 						| 70.47	| 90.16	| 3.7M    | 0.6G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1wkx4un6y7V87Rp_ZlD4_pV63QRst-1AE/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mab4dOtBB-HsdzFJYrvgjA)(dxgb) |
+| pvtv2_b1 						| 78.70	| 94.49	| 14.0M   | 2.1G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/11hqLxL2MTSnKPb-gp2eMZLAzT6q2UsmG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Ur0s4SEOxVqggmgq6AM-sQ)(2e5m) |
+| pvtv2_b2 						| 82.02	| 95.99	| 25.4M   | 4.0G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1-KY6NbS3Y3gCaPaUam0v_Xlk1fT-N1Mz/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FWx0QB7_8_ikrPIOlL7ung)(are2) |
+| pvtv2_b2_linear 				| 82.06	| 96.04	| 22.6M   | 3.9G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1hC8wE_XanMPi0_y9apEBKzNc4acZW5Uy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IAhiiaJPe-Lg1Qjxp2p30w)(a4c8) |
+| pvtv2_b3 						| 83.14	| 96.47	| 45.2M   | 6.8G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/16yYV8x7aKssGYmdE-YP99GMg4NKGR5j1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ge0rBsCqIcpIjrVxsrFhnw)(nc21) |
+| pvtv2_b4 						| 83.61	| 96.69	| 62.6M   | 10.0G  | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1gvPdvDeq0VchOUuriTnnGUKh0N2lj-fA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VMSD_Kr_hduCZ5dxmDbLoA)(tthf) |
+| pvtv2_b5 						| 83.77	| 96.61	| 82.0M   | 11.5G  | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1OHaHiHN_AjsGYBN2gxFcQCDhBbTvZ02g/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ey4agxI2Nb0F6iaaX3zAbA)(9v6n) |
+| | | | | | | | | | 
+| shuffle_vit_tiny  			| 82.39 | 96.05 | 28.5M   | 4.6G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ffJ-tG_CGVXztPEPQMaT_lUoc4hxFy__/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19DhlLIFyPGOWtyq_c83ZGQ)(8a1i) |
+| shuffle_vit_small 			| 83.53 | 96.57 | 50.1M   | 8.8G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1du9H0SKr0QH9GQjhWDOXOnhpSVpfbb8X/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rM2J8BVwxQ3kRZoHngwNZA)(xwh3) |
+| shuffle_vit_base  			| 83.95 | 96.91 | 88.4M   | 15.5G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1sYh808AyTG3-_qv6nfN6gCmyagsNAE6q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1fks_IYDdnXdAkCFuYHW_Nw)(1gsr) |
+| | | | | | | | | |
+| t2t_vit_7      				| 71.68 | 90.89 | 4.3M    | 1.0G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YkuPs1ku7B_udydOf_ls1LQvpJDg_c_j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jVNsz37gatLCDaOoU3NaMA)(1hpa) |
+| t2t_vit_10     				| 75.15 | 92.80 | 5.8M    | 1.3G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1H--55RxliMDlOCekn7FpKrHDGsUkyrJZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nbdb4PFMq4nsIp8HrNxLQg)(ixug) |
+| t2t_vit_12     				| 76.48 | 93.49 | 6.9M    | 1.5G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1stnIwOwaescaEcztaF1QjI4NK4jaqN7P/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DcMzq9WeSwrS3epv6jKJXw)(qpbb) |
+| t2t_vit_14     				| 81.50 | 95.67 | 21.5M   | 4.4G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1HSvN3Csgsy7SJbxJYbkzjUx9guftkfZ1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wcfh22uopBv7pS7rKcH_iw)(c2u8) |
+| t2t_vit_19     				| 81.93 | 95.74 | 39.1M   | 7.8G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1eFnhaL6I33pHCQw2BaEE0Oet9CnjmUf_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_24     				| 82.28 | 95.89 | 64.0M   | 12.8G  | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Z7nZCHeFp0AhIkGYcMAFkKdkGN0yXtpv/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_t_14   				| 81.69 | 95.85 | 21.5M   | 4.4G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/16li4voStt_B8eWDXqJt7s20OT_Z8L263/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_t_19   				| 82.44 | 96.08 | 39.1M   | 7.9G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Ty-42SYOu15Nk8Uo6VRTJ7J0JV_6t7zJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YdQd6l8tj5xMCWvcHWm7sg)(mier) |
+| t2t_vit_t_24   				| 82.55 | 96.07 | 64.0M   | 12.9G  | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1cvvXrGr2buB8Np2WlVL7n_F1_CnI1qow/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BMU3KX_TRmPxQ1jN5cmWhg)(6vxc) |
+| t2t_vit_14_384 				| 83.34 | 96.50 | 21.5M   | 13.0G  | 384   	    | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Yuso8WD7Q8Lu_9I8dTvAvkcXXtPSkmnm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AOMhyVRF9zPqJe-lTrd7pw)(r685) |
+| | | | | | | | | |
+| cross_vit_tiny_224 			| 73.20 | 91.90 | 6.9M    | 1.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ILTVwQtetcb_hdRjki2ZbR26p-8j5LUp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1byeUsM34_gFL0jVr5P5GAw)(scvb) |
+| cross_vit_small_224 			| 81.01 | 95.33 | 26.7M   | 5.2G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ViOJiwbOxTbk1V2Go7PlCbDbWPbjWPJH/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1I9CrpdPU_D5LniqIVBoIPQ)(32us) |
+| cross_vit_base_224 			| 82.12 | 95.87 | 104.7M  | 20.2G  | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vTorkc63O4JE9cYUMHBRxFMDOFoC-iK7/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1TR_aBHQ2n1J0RgHFoVh_bw)(jj2q) |
+| cross_vit_9_224 				| 73.78 | 91.93 | 8.5M    | 1.6G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1UCX9_mJSx2kDAmEd_xDXyd4e6-Mg3RPf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1M8r5vqMHJ-rFwBoW1uL2qQ)(mjcb) |
+| cross_vit_15_224 				| 81.51 | 95.72 | 27.4M   | 5.2G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HwkLWdz6A3Nz-dVbw4ZUcCkxUbPXgHwM/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wiO_Gjk4fvSq08Ud8xKwVw)(n55b) |
+| cross_vit_18_224 				| 82.29 | 96.00 | 43.1M   | 8.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1C4b_a_6ia8NCEXSUEMDdCEFzedr0RB_m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w7VJ7DNqq6APuY7PdlKEjA)(xese) |
+| cross_vit_9_dagger_224 		| 76.92 | 93.61 | 8.7M    | 1.7G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1_cXQ0M8Hr9UyugZk07DrsBl8dwwCA6br/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1F1tRSaG4EfCV_WiTEwXxBw)(58ah) |
+| cross_vit_15_dagger_224 		| 82.23 | 95.93 | 28.1M   | 5.6G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1cCgBoozh2WFtSz42LwEUUPPyC5KmkAFg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1xJ4P2zy3r9RcNFSMtzvZgg)(qwup) |
+| cross_vit_18_dagger_224 		| 82.51 | 96.03 | 44.1M   | 8.7G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1sdAbWxKL5k3QIo1zdgHzasIOtpy_Ogpw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15qYHgt0iRxdhtXoC_ct2Jg)(qtw4) |
+| cross_vit_15_dagger_384 		| 83.75 | 96.75 | 28.1M   | 16.4G  | 384   	    | 1.0      | bicubic       | [google](https://drive.google.com/file/d/12LQjYbs9-LyrY1YeRt46x9BTB3NJuhpJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1d-BAm03azLP_CyEHF3c7ZQ)(w71e) |
+| cross_vit_18_dagger_384 		| 84.17 | 96.82 | 44.1M   | 25.8G  | 384   	    | 1.0 	   | bicubic       | [google](https://drive.google.com/file/d/1CeGwB6Tv0oL8QtL0d7Ar-d02Lg_PqACr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1l_6PTldZ3IDB7XWgjM6LhA)(99b6) |
+| | | | | | | | | | 
+| beit_base_patch16_224_pt22k   | 85.21 | 97.66 | 87M    | 12.7G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1lq5NeQRDHkIQi7U61OidaLhNsXTWfh_Z/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1pjblqaESqfXVrpgo58oR6Q)(fshn) |
+| beit_base_patch16_384_pt22k   | 86.81 | 98.14 | 87M    | 37.3G   | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1wn2NS7kUdlERkzWEDeyZKmcRbmWL7TR2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WVbNjxuIUh514pKAgZZEzg)(arvc) |
+| beit_large_patch16_224_pt22k  | 87.48 | 98.30 | 304M   | 45.0G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/11OR1FKxzfafqT7GzTW225nIQjxmGSbCm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1bvhERVXN2TyRcRJFzg7sIA)(2ya2) |
+| beit_large_patch16_384_pt22k  | 88.40 | 98.60 | 304M   | 131.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/10EraafYS8CRpEshxClOmE2S1eFCULF1Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1H76G2CGLY3YmmYt4-suoRA)(qtrn) |
+| beit_large_patch16_512_pt22k  | 88.60 | 98.66 | 304M   | 234.0G  | 512        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1xIIocftsB1PcDHZttPqLdrJ-G4Tyfrs-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WtTVK_Wvg-izaF0M6Gzw-Q)(567v) |
+| | | | | | | | | | 
+| Focal-T    					| 82.03 | 95.86 | 28.9M   | 4.9G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HzZJbYH_eIo94h0wLUhqTyJ6AYthNKRh/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JCr2qIA-SZvTqbTO-m2OwA)(i8c2) |
+| Focal-T (use conv)   			| 82.70 | 96.14 | 30.8M   | 4.9G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1PS0-gdXHGl95LqH5k5DG62AH6D3i7v0D/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tVztox4bVJuJEjkD1fLaHQ)(smrk) |
+| Focal-S    					| 83.55 | 96.29 | 51.1M   | 9.4G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HnVAYsI_hmiomyS4Ax3ccPE7gk4mlTU8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1b7uugAY9RhrgTkUwYcvvow)(dwd8) |
+| Focal-S (use conv)   			| 83.85 | 96.47 | 53.1M   | 9.4G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vcHjYiGNMayoSTPoM8z39XRH6h89TB9V/view?usp=sharing)/[baidu](https://pan.baidu.com/s/174a2aZzCEt3teLuAnIzMtA)(nr7n) |
+| Focal-B    					| 83.98 | 96.48 | 89.8M   | 16.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1bNMegxetWpwZNcmDEC3MHCal6SNXSgWR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1piBslNhxWR78aQJIdoZjEw)(8akn) |
+| Focal-B (use conv)   			| 84.18 | 96.61 | 93.3M   | 16.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1-J2gDnKrvZGtasvsAYozrbMXR2LtIJ43/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GTLfnTlt6I6drPdfSWB1Iw)(5nfi) |
+| | | | | | | | | | 
+| mobilevit_xxs   				| 70.31| 89.68 | 1.32M   | 0.44G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1l3L-_TxS3QisRUIb8ohcv318vrnrHnWA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KFZ5G834_-XXN33W67k8eg)(axpc) |
+| mobilevit_xs   				| 74.47| 92.02 | 2.33M   | 0.95G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1oRMA4pNs2Ba0LYDbPufC842tO4OFcgwq/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IP8S-S6ZAkiL0OEsiBWNkw)(hfhm) |
+| mobilevit_s   				| 76.74| 93.08 | 5.59M   | 1.88G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1ibkhsswGYWvZwIRjwfgNA4-Oo2stKi0m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-rI6hiCHZaI7os2siFASNg)(34bg) |
+| mobilevit_s $\dag$  			| 77.83| 93.83 | 5.59M   | 1.88G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1BztBJ5jzmqgDWfQk-FB_ywDWqyZYu2yG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19YepMAO-sveBOLA4aSjIEQ?pwd=92ic)(92ic) |
+| | | | | | | | | | 
+| vip_s7  						| 81.50 | 95.76 | 25.1M   | 7.0G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/16bZkqzbnN08_o15k3MzbegK8SBwfQAHF/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1uY0FsNPYaM8cr3ZCdAoVkQ)(mh9b) |
+| vip_m7  						| 82.75 | 96.05 | 55.3M   | 16.4G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/11lvT2OXW0CVGPZdF9dNjY_uaEIMYrmNu/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1j3V0Q40iSqOY15bTKlFFRw)(hvm8) |
+| vip_l7  						| 83.18 | 96.37 | 87.8M   | 24.5G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1bK08JorLPMjYUep_TnFPKGs0e1j0UBKJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1I5hnv3wHWEaG3vpDqaNL-w)(tjvh) |
+| | | | | | | | | | 
+| xcit_nano_12_p16_224_dist   | 72.32  | 90.86  | 0.6G    | 3.1M      | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/14FsYtm48JB-rQFF9CanJsJaPESniWD7q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15kdY4vzwU2QiBSU5127AYA)(7qvz)     |
+| xcit_nano_12_p16_384_dist   | 75.46  | 92.70  | 1.6G    | 3.1M      | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zR-hFQryocF9muG-erzcxFuJme5y_e9f/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1449qtQzEMg6lqdtClyiCRQ)(1y2j)     |
+| xcit_large_24_p16_224_dist  | 84.92  | 97.13  | 35.9G   | 189.1M    | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lAtko_KwOagjwaFvUkeXirVClXCV8gt-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Gs401mXqG1bifi1hBdXtig)(kfv8)     |
+| xcit_large_24_p16_384_dist  | 85.76  | 97.54  | 105.5G  | 189.1M    | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/15djnKz_-eooncvyZp_UTwOiHIm1Hxo_G/view?usp=sharing)/[baidu](https://pan.baidu.com/s/14583hbtIVbZ_2ifZepQItQ)(ffq3)     |
+| xcit_nano_12_p8_224_dist    | 76.33  | 93.10  | 2.2G    | 3.0M      | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1XxRNjskLvSVp6lvhlsnylq6g7vd_5MsI/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DZJxuahFJyz-rEEsCqhhrA)(jjs7)     |
+| xcit_nano_12_p8_384_dist    | 77.82  | 94.04  | 6.3G    | 3.0M      | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1P3ln8JqLzMKbJAhCanRbu7i5NMPVFNec/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ECY9-PVDMNSup8NMQiqBrw)(dmc1)     |
+| xcit_large_24_p8_224_dist   | 85.40  | 97.40  | 141.4G  | 188.9M    | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/14ZoDxEez5NKVNAsbgjTPisjOQEAA30Wy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1D_zyvjzIVFp6iqx1s7IEbA)(y7gw)     |
+| xcit_large_24_p8_384_dist   | 85.99  | 97.69  | 415.5G  | 188.9M    | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1stcUwwFNJ38mdaFsNXq24CBMmDenJ_e4/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1lwbBk7GFuqnnP_iU2OuDRw)(9xww)     |
+| | | | | | | | | |
+| pit_ti 	     | 72.91	| 91.40	| 4.8M    | 0.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1bbeqzlR_CFB8CAyTUN52p2q6ii8rt0AW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Yrq5Q16MolPYHQsT_9P1mw)(ydmi)  |
+| pit_ti_distill | 74.54	| 92.10 | 5.1M    | 0.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1m4L0OVI0sYh8vCv37WhqCumRSHJaizqX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1RIM9NGq6pwfNN7GJ5WZg2w)(7k4s)  |
+| pit_xs 	     | 78.18    | 94.16 | 10.5M   | 1.1G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1qoMQ-pmqLRQmvAwZurIbpvgMK8MOEgqJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15d7ep05vI2UoKvL09Zf_wg)(gytu)  |
+| pit_xs_distill | 79.31 	| 94.36 | 10.9M   | 1.1G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1EfHOIiTJOR-nRWE5AsnJMsPCncPHEgl8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DqlgVF7U5qHfGD3QJAad4A)(ie7s)  |
+| pit_s  		 | 81.08 	| 95.33 | 23.4M   | 2.4G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1TDSybTrwQpcFf9PgCIhGX1t-f_oak66W/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Vk-W1INskQq7J5Qs4yphCg)(kt1n)  |
+| pit_s_distill  | 81.99 	| 95.79 | 24.0M   | 2.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1U3VPP6We1vIaX-M3sZuHmFhCQBI9g_dL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L7rdWmMW8tiGkduqmak9Fw)(hhyc)  |
+| pit_b   		 | 82.44 	| 95.71 | 73.5M	  | 10.6G  | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1-NBZ9-83nZ52jQ4DNZAIj8Xv6oh54nx-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1XRDPY4OxFlDfl8RMQ56rEg)(uh2v)  |
+| pit_b_distill  | 84.14 	| 96.86 | 74.5M   | 10.7G  | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/12Yi4eWDQxArhgQb96RXkNWjRoCsDyNo9/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1vJOUGXPtvC0abg-jnS4Krw)(3e6g)  |
+| | | | | | | | | |
+| halonet26t 	 | 79.10	| 94.31	| 12.5M    | 3.2G   | 256        | 0.95     | bicubic       |[google](https://drive.google.com/file/d/1F_a1brftXXnPM39c30NYe32La9YZQ0mW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FSlSTuYMpwPJpi4Yz2nCTA)(ednv)  |
+| halonet50ts 	 | 81.65	| 95.61	| 22.8M    | 5.1G   | 256        | 0.94     | bicubic       |[google](https://drive.google.com/file/d/12t85kJcPA377XePw6smch--ELMBo6p0Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1X4LM-sqoTKG7CrM5BNjcdA)(3j9e)  |
+| | | | | | | | | |
+| poolformer_s12 | 77.24 | 93.51 | 11.9M   | 1.8G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/15EBfTTU6coLCsDNiLgAWYiWeMpp3uYH4/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1n6TUxQGlssTu4lyLrBOXEw)(zcv4)             |
+| poolformer_s24 | 80.33 | 95.05 | 21.3M   | 3.4G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1JxqJluDpp1wwe7XtpTi1aWaVvlq0Q3xF/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1d2uyHB5R6ZWPzXWhdtm6fw)(nedr)             |
+| poolformer_s36 | 81.43 | 95.45 | 30.8M   | 5.0G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1ka3VeupDRFBSzzrcw4wHXKGqoKv6sB_Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1de6ZJkmYEmVI7zKUCMB_xw)(fvpm)             |
+| poolformer_m36 | 82.11 | 95.69 | 56.1M   | 8.9G   | 224        | 0.95     | bicubic       | [google](https://drive.google.com/file/d/1LTZ8wNRb_GSrJ9H3qt5-iGiGlwa4dGAK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qNTYLw4vyuoH1EKDXEcSvw)(whfp)             |
+| poolformer_m48 | 82.46 | 95.96 | 73.4M   | 11.8G  | 224        | 0.95     | bicubic       | [google](https://drive.google.com/file/d/1YhXEVjWtI4bZB_Qwama8G4RBanq2K15L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VJXANTseTUEA0E6HYf-XyA)(374f)             |
+| | | | | | | | | |
+| botnet50 	 | 77.38	| 93.56	| 20.9M    | 5.3G   | 224        | 0.875     | bicubic       |[google](https://drive.google.com/file/d/1S4nxgRkElT3K4lMx2JclPevmP3YUHNLw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1CW40ShBJQYeFgdBIZZLSjg)(wh13)
+| | | | | | | | | |
+| CvT-13-224      | 81.59 | 95.67 | 20M    | 4.5G    | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1r0fnHn1bRPmN0mi8RwAPXmD4utDyOxEf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13xNwCGpdJ5MVUi369OGl5Q)(vev9) |
+| CvT-21-224      | 82.46 | 96.00 | 32M    | 7.1G    | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/18s7nRfvcmNdbRuEpTQe02AQE3Y9UWVQC/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mOjbMNoQb7X3VJD3LV0Hhg)(t2rv) |
+| CvT-13-384   	  | 83.00 | 96.36 | 20M    | 16.3G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1J0YYPUsiXSqyExBPtOPrOLL9c16syllg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1upITRr5lNHLjbBJtIr-jdg)(wswt) |
+| CvT-21-384   	  | 83.27 | 96.16 | 32M    | 24.9G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1tpXv_yYXtvyArlYi7AFcHUOqemhyMWHW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hXKi3Kb7mNxPFVmR6cdkMg)(hcem) |
+| CvT-13-384-22k  | 83.26 | 97.09 | 20M    | 16.3G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/18djrvq422u1pGLPxNfWAp6d17F7C5lbP/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YYv5rKPmroxKCnzkesUr0g)(c7m9) |
+| CvT-21-384-22k  | 84.91 | 97.62 | 32M    | 24.9G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1NVXd7vxVoRpL-21GN7nGn0-Ut0L0Owp8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1N3xNU6XFHb1CdEOrnjKuoA)(9jxe) |
+| CvT-w24-384-22k | 87.58 | 98.47 | 277M   | 193.2G  | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1M3bg46N4SGtupK8FcvAOE0jltOwP5yja/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1MNJurm8juHRGG9SAw3IOkg)(bbj2) |
+| | | | | | | | | |
+| HVT-Ti-1       | 69.45 | 89.28 | 5.7M    | 0.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/11BW-qLBMu_1TDAavlrAbfVlXB53dgm42/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16rZvJqL-UVuWFsCDuxFDqg?pwd=egds)(egds) |
+| HVT-S-0        | 80.30 | 95.15 | 22.0M   | 4.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1GlJ2j2QVFye1tAQoUJlgKTR_KELq3mSa/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L-tjDxkQx00jg7BsDClabA?pwd=hj7a)(hj7a) |
+| HVT-S-1        | 78.06 | 93.84 | 22.1M   | 2.4G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/16H33zNIpNrHBP1YhCq4zmLjRYQJ0XEmX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1quOsgVuxTcauISQ3SehysQ?pwd=tva8)(tva8) |
+| HVT-S-2        | 77.41 | 93.48 | 22.1M   | 1.9G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1U14LA7SXJtFep_SdUCjAV-cDOQ9A_OFk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nooWTBzaXyBtEgadn9VDmw?pwd=bajp)(bajp) |
+| HVT-S-3        | 76.30 | 92.88 | 22.1M   | 1.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1m1CjOcZfPMLDRyX4QBgMhHV1m6rtu44v/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15sAOmQN6Hx0GLelYDuMQXw?pwd=rjch)(rjch) |
+| HVT-S-4        | 75.21 | 92.34 | 22.1M   | 1.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/14comGo9lO12dUeGGL52MuIJWZPSit7I0/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1o31hMRWR7FTCjUk7_fAOgA?pwd=ki4j)(ki4j) |
+| | | | | | | | | |
+| | | | | | | | | |
+| mlp_mixer_b16_224            	| 76.60 | 92.23 | 60.0M   | 12.7G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ZcQEH92sEPvYuDc6eYZgssK5UjYomzUD/view?usp=sharing)/[baidu](https://pan.baidu.com/s/12nZaWGMOXwrCMOIBfUuUMA)(xh8x) |
+| mlp_mixer_l16_224           	| 72.06 | 87.67 | 208.2M  | 44.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1mkmvqo5K7JuvqGm92a-AdycXIcsv1rdg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AmSVpwCaGR9Vjsj_boL7GA)(8q7r) |
+| | | | | | | | | |
+| resmlp_24_224                	| 79.38 | 94.55 | 30.0M   | 6.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15A5q1XSXBz-y1AcXhy_XaDymLLj2s2Tn/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nLAvyG53REdwYNCLmp4yBA)(jdcx) |
+| resmlp_36_224             	| 79.77 | 94.89 | 44.7M   | 9.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1WrhVm-7EKnLmPU18Xm0C7uIqrg-RwqZL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QD4EWmM9b2u1r8LsnV6rUA)(33w3) |
+| resmlp_big_24_224         	| 81.04 | 95.02 | 129.1M  | 100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1KLlFuzYb17tC5Mmue3dfyr2L_q4xHTZi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1oXU6CR0z7O0XNwu_UdZv_w)(r9kb) |
+| resmlp_12_distilled_224 		| 77.95 | 93.56 | 15.3M   |	3.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1cDMpAtCB0pPv6F-VUwvgwAaYtmP8IfRw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15kJeZ_V1MMjTX9f1DBCgnw)(ghyp) |
+| resmlp_24_distilled_224 		| 80.76 | 95.22 | 30.0M   |	6.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15d892ExqR1sIAjEn-cWGlljX54C3vihA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NgQtSwuAwsVVOB8U6N4Aqw)(sxnx) |
+| resmlp_36_distilled_224 		| 81.15 | 95.48 | 44.7M	  | 9.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1Laqz1oDg-kPh6eb6bekQqnE0m-JXeiep/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1p1xGOJbMzH_RWEj36ruQiw)(vt85) |
+| resmlp_big_24_distilled_224 	| 83.59 | 96.65 | 129.1M  |	100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/199q0MN_BlQh9-HbB28RdxHj1ApMTHow-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yUrfbqW8vLODDiRV5WWkhQ)(4jk5) |
+| resmlp_big_24_22k_224   		| 84.40 | 97.11 | 129.1M  | 100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1zATKq1ruAI_kX49iqJOl-qomjm9il1LC/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VrnRMbzzZBmLiR45YwICmA)(ve7i) |
+| | | | | | | | | |
+| gmlp_s16_224                 	| 79.64 | 94.63 | 19.4M   | 4.5G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TLypFly7aW0oXzEHfeDSz2Va4RHPRqe5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13UUz1eGIKyqyhtwedKLUMA)(bcth) |
+| | | | | | | | | |
+| ff_only_tiny (linear_tiny) 	| 61.28 | 84.06 |         |        | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/14bPRCwuY_nT852fBZxb9wzXzbPWNfbCG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nNE4Hh1Nrzl7FEiyaZutDA)(mjgd) |
+| ff_only_base (linear_base) 	| 74.82 | 91.71 |         |        | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1DHUg4oCi41ELazPCvYxCFeShPXE4wU3p/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1l-h6Cq4B8kZRvHKDTzhhUg)(m1jc) |
+| | | | | | | | | |
+| repmlp_res50_light_224 		| 77.01 | 93.46 | 87.1M   | 3.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/16bCFa-nc_-tPVol-UCczrrDO_bCFf2uM/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1bzmpS6qJJTsOq3SQE7IOyg)(b4fg) |
+| | | | | | | | | |
+| cyclemlp_b1 					 | 78.85 | 94.60 | 15.1M   |    | 224   	    | 0.9    | bicubic       | [google](https://drive.google.com/file/d/10WQenRy9lfOJF4xEHc9Mekp4zHRh0mJ_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/11UQp1RkWBsZFOqit_uU80w)(mnbr) |
+| cyclemlp_b2 					 | 81.58 | 95.81 | 26.8M   |    | 224   	    | 0.9    | bicubic       | [google](https://drive.google.com/file/d/1dtQHCwtxNh9jgiHivN5iYpHe7uKRUjhk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Js-Oq5vyiB7oPagn43cn3Q)(jwj9) |
+| cyclemlp_b3 					 | 82.42 | 96.07 | 38.3M   |    | 224   	    | 0.9    | bicubic       | [google](https://drive.google.com/file/d/11kMq112tAwVE5llJIepIIixz74AjaJhU/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1b7cau1yPxqATA8X7t2DXkw)(v2fy) |
+| cyclemlp_b4 					 | 82.96 | 96.33 | 51.8M   |    | 224   	    | 0.875  | bicubic       | [google](https://drive.google.com/file/d/1vwJ0eD9Ic-NvLvCz1zEAmn7RxBMtd_v2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1P3TlnXRFGWj9nVP5xBGGWQ)(fnqd) |
+| cyclemlp_b5 					 | 83.25 | 96.44 | 75.7M   |    | 224   	    | 0.875  | bicubic       | [google](https://drive.google.com/file/d/12_I4cfOBfp7kC0RvmnMXFqrSxww6plRW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-Cka1tNqGUQutkAP3VZXzQ)(s55c) |
+| | | | | | | | | |
+| convmixer_1024_20  			| 76.94 | 93.35 | 24.5M   | 9.5G   |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/1R7zUSl6_6NFFdNOe8tTfoR9VYQtGfD7F/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DgGA3qYu4deH4woAkvjaBw)(qpn9) |
+| convmixer_768_32  			| 80.16 | 95.08 | 21.2M   | 20.8G  |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/196Lg_Eet-hRj733BYASj22g51wdyaW2a/view?usp=sharing)/[baidu](https://pan.baidu.com/s/17CbRNzY2Sy_Cu7cxNAkWmQ)(m5s5) |
+| convmixer_1536_20  			| 81.37 | 95.62 | 51.8M   | 72.4G  |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/1-LlAlADiu0SXDQmE34GN2GBhqI-RYRqO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1R-gSzhzQNfkuZVxsaE4vEw)(xqty) |
+| | | | | | | | | |
+| convmlp_s			  			| 76.76 | 93.40 | 9.0M    | 2.4G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1D8kWVfQxOyyktqDixaZoGXB3wVspzjlc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WseHYALFB4Of3Dajmlt45g)(3jz3) |
+| convmlp_m			  			| 79.03 | 94.53 | 17.4M   | 4.0G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TqVlKHq-WRdT9KDoUpW3vNJTIRZvix_m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koipCAffG6REUyLYk0rGAQ)(vyp1) |
+| convmlp_l			  			| 80.15 | 95.00 | 42.7M   | 10.0G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1KXxYogDh6lD3QGRtFBoX5agfz81RDN3l/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1f1aEeVoySzImI89gkjcaOA)(ne5x) |
+| | | | | | | | | |
+
+
+
+
+
+
 
-### Docker Install ###
-(coming soon)
 
 
 
 
-## Results (Ported Weights) ## 
-### Image Classification ###
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| vit_base_patch16_224           | 84.58 | 97.30 | 224        | 0.875    | bicubic      | [google](https://drive.google.com/file/d/13D9FqU4ISsGxWXURgKW9eLOBV-pYPr-L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ms3o2fHMQpIoVqnEHitRtA)(qv4n) |
-| vit_base_patch16_384           | 85.99 | 98.00 | 384        | 1.0      | bicubic      | [google](https://drive.google.com/file/d/1kWKaAgneDx0QsECxtf7EnUdUZej6vSFT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15ggLdiL98RPcz__SXorrXA)(wsum) |
-| vit_large_patch16_224          | 85.81 | 97.82 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1jgwtmtp_cDWEhZE-FuWhs7lCdpqhAMft/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1HRxUJAwEiKgrWnJSjHyU0A)(1bgk) |
-| swin_base_patch4_window7_224   | 85.27 | 97.56 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1yjZFJoJeDFIfsxh9x10XGqCb8s2-Gtbp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AseY3CKmJvlxoSwXnxHEwA)(wyck) |
-| swin_base_patch4_window12_384  | 86.43 | 98.07 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1ThmGsTDZ8217-Zuo9o5EGLfzw8AI6N0w/view?usp=sharing)/[baidu](https://pan.baidu.com/s/10E3F9jqBeBTcasIvJ8iMzg)(4a95) |
-| swin_large_patch4_window12_384 | 87.14 | 98.23 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1f30Mt80g5yLfEiViT4-kMLpyDjTUTV5B/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w5w8QNfg0zY3nSfGs-Tx3A)(j71u) |
-| pvtv2_b0 			| 70.47	| 90.16	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1wkx4un6y7V87Rp_ZlD4_pV63QRst-1AE/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mab4dOtBB-HsdzFJYrvgjA)(dxgb) |
-| pvtv2_b1 			| 78.70	| 94.49	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/11hqLxL2MTSnKPb-gp2eMZLAzT6q2UsmG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Ur0s4SEOxVqggmgq6AM-sQ)(2e5m) |
-| pvtv2_b2 			| 82.02	| 95.99	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1-KY6NbS3Y3gCaPaUam0v_Xlk1fT-N1Mz/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FWx0QB7_8_ikrPIOlL7ung)(are2) |
-| pvtv2_b3 			| 83.14	| 96.47	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/16yYV8x7aKssGYmdE-YP99GMg4NKGR5j1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ge0rBsCqIcpIjrVxsrFhnw)(nc21) |
-| pvtv2_b4 			| 83.61	| 96.69	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1gvPdvDeq0VchOUuriTnnGUKh0N2lj-fA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VMSD_Kr_hduCZ5dxmDbLoA)(tthf) |
-| pvtv2_b5 			| 83.77	| 96.61	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1OHaHiHN_AjsGYBN2gxFcQCDhBbTvZ02g/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ey4agxI2Nb0F6iaaX3zAbA)(9v6n) |
-| pvtv2_b2_linear 	| 82.06	| 96.04	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1hC8wE_XanMPi0_y9apEBKzNc4acZW5Uy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IAhiiaJPe-Lg1Qjxp2p30w)(a4c8) |
-| mlp_mixer_b16_224                  | 76.60 | 92.23 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ZcQEH92sEPvYuDc6eYZgssK5UjYomzUD/view?usp=sharing)/[baidu](https://pan.baidu.com/s/12nZaWGMOXwrCMOIBfUuUMA)(xh8x) |
-| mlp_mixer_l16_224           | 72.06 | 87.67 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1mkmvqo5K7JuvqGm92a-AdycXIcsv1rdg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AmSVpwCaGR9Vjsj_boL7GA)(8q7r) |
-| resmlp_24_224                  | 79.38 | 94.55 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15A5q1XSXBz-y1AcXhy_XaDymLLj2s2Tn/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nLAvyG53REdwYNCLmp4yBA)(jdcx) |
-| resmlp_36_224             | 79.77 | 94.89 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1WrhVm-7EKnLmPU18Xm0C7uIqrg-RwqZL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QD4EWmM9b2u1r8LsnV6rUA)(33w3) |
-| resmlp_big_24_224         | 81.04 | 95.02 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1KLlFuzYb17tC5Mmue3dfyr2L_q4xHTZi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1oXU6CR0z7O0XNwu_UdZv_w)(r9kb) |
-| resmlp_big_24_distilled_224 | 83.59 | 96.65 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/199q0MN_BlQh9-HbB28RdxHj1ApMTHow-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yUrfbqW8vLODDiRV5WWkhQ)(4jk5) |
-| gmlp_s16_224                   | 79.64 | 94.63 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TLypFly7aW0oXzEHfeDSz2Va4RHPRqe5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13UUz1eGIKyqyhtwedKLUMA)(bcth) |
-| volo_d5_224_86.10              | 86.08 | 97.58 | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1GBOBPCBJYZfWybK-Xp0Otn0N4NXpct0G/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1t9gPLRAOkdXaG55fVADQZg)(td49) |
-| volo_d5_512_87.07              | 87.05 | 97.97 | 512        | 1.15     | bicubic       | [google](https://drive.google.com/file/d/1Phf_wHsjRZ1QrZ8oFrqsYuhDr4TXrVkc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1X-WjpNqvWva2M977jgHosg)(irik) |
-| cait_xxs24_224                 | 78.38 | 94.32 | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LKsQUr824oY4E42QeUEaFt41I8xHNseR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YIaBLopKIK5_p7NlgWHpGA)(j9m8) |
-| cait_s24_384                   | 85.05 | 97.34 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1GU0esukDvMg3u40FZB_5GiB6qpShjvGh/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qvhNckJjcEf5HyVn8LuEeA)(qb86) |
-| cait_m48_448                   | 86.49  | 97.75 | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lJSP__dVERBNFnp7im-1xM3s_lqEe82-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/179MA3MkG2qxFle0K944Gkg)(imk5) |
-| deit_base_distilled_patch16_224| 83.32  | 96.49 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/12_x6-NN3Jde2BFUih4OM9NlTwe9-Xlkw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ZnmAWgT6ewe7Vl3Xw_csuA)(5f2g) |
-| deit_base_distilled_patch16_384| 85.43  | 97.33 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1i5H_zjSdHfM-Znv89DHTv9ChykWrIt8I/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PQsQIci4VCHY7l2tCzMklg)(qgj2) |
-| shuffle_vit_tiny_patch4_window7| 82.39  | 96.05 | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1ffJ-tG_CGVXztPEPQMaT_lUoc4hxFy__/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19DhlLIFyPGOWtyq_c83ZGQ)(8a1i) |
-| shuffle_vit_small_patch4_window7| 83.53 | 96.57 | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1du9H0SKr0QH9GQjhWDOXOnhpSVpfbb8X/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rM2J8BVwxQ3kRZoHngwNZA)(xwh3) |
-| shuffle_vit_base_patch4_window7| 83.95  | 96.91 | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1sYh808AyTG3-_qv6nfN6gCmyagsNAE6q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1fks_IYDdnXdAkCFuYHW_Nw)(1gsr) |
-| cswin_tiny_224  | 82.81  | 96.30 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1l-JY0u7NGyD6SjkyiyNnDx3wFFT1nAYO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L5FqU7ImWAhQHAlSilqVAw)(4q3h) |
-| cswin_small_224 | 83.60  | 96.58 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/10eEBk3wvJdQ8Dy58LvQ11Wk1K2UfPy-E/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FiaNiWyAuWu1IBsUFLUaAw)(gt1a) |
-| cswin_base_224  | 84.23  | 96.91 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YufKh3DKol4-HrF-I22uiorXSZDIXJmZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koy8hXyGwvgAfUxdlkWofg)(wj8p) |
-| cswin_large_224 | 86.52  | 97.99 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1V1hteGK27t1nI84Ac7jdWfydBLLo7Fxt/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KgIX6btML6kPiPGkIzvyVA)(b5fs) |
-| cswin_base_384  | 85.51  | 97.48 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1qCaFItzFoTYBo-4UbGzL6M5qVDGmJt4y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WNkY7o_vP9KJ8cd5c7n2sQ)(rkf5) |
-| cswin_large_384 | 87.49  | 98.35 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LRN_6qUz71yP-OAOpN4Lscb8fkUytMic/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1eCIpegPj1HIbJccPMaAsew)(6235) |
-| t2t_vit_7      | 71.68 | 90.89 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YkuPs1ku7B_udydOf_ls1LQvpJDg_c_j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jVNsz37gatLCDaOoU3NaMA)(1hpa) |
-| t2t_vit_10     | 75.15 | 92.80 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1H--55RxliMDlOCekn7FpKrHDGsUkyrJZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nbdb4PFMq4nsIp8HrNxLQg)(ixug) |
-| t2t_vit_12     | 76.48 | 93.49 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1stnIwOwaescaEcztaF1QjI4NK4jaqN7P/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DcMzq9WeSwrS3epv6jKJXw)(qpbb) |
-| t2t_vit_14     | 81.50 | 95.67 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1HSvN3Csgsy7SJbxJYbkzjUx9guftkfZ1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wcfh22uopBv7pS7rKcH_iw)(c2u8) |
-| t2t_vit_19     | 81.93 | 95.74 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1eFnhaL6I33pHCQw2BaEE0Oet9CnjmUf_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
-| t2t_vit_24     | 82.28 | 95.89 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Z7nZCHeFp0AhIkGYcMAFkKdkGN0yXtpv/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
-| t2t_vit_t_14   | 81.69 | 95.85 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/16li4voStt_B8eWDXqJt7s20OT_Z8L263/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
-| t2t_vit_t_19   | 82.44 | 96.08 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Ty-42SYOu15Nk8Uo6VRTJ7J0JV_6t7zJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YdQd6l8tj5xMCWvcHWm7sg)(mier) |
-| t2t_vit_t_24   | 82.55 | 96.07 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1cvvXrGr2buB8Np2WlVL7n_F1_CnI1qow/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BMU3KX_TRmPxQ1jN5cmWhg)(6vxc) |
-| t2t_vit_14_384 | 83.34 | 96.50 | 384   | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Yuso8WD7Q8Lu_9I8dTvAvkcXXtPSkmnm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AOMhyVRF9zPqJe-lTrd7pw)(r685) |
 
 
 
@@ -200,6 +392,16 @@ We also provide tutorials:
 |-------|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | DETR  | ResNet50  | 42.0    | [google](https://drive.google.com/file/d/1ruIKCqfh_MMqzq_F4L2Bv-femDMjS_ix/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1J6lB1mezd6_eVW3jnmohZA)(n5gk) |
 | DETR  | ResNet101 | 43.5    | [google](https://drive.google.com/file/d/11HCyDJKZLX33_fRGp4bCg1I14vrIKYW5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1_msuuAwFMNbAlMpgUq89Og)(bxz2) |
+| Mask R-CNN | Swin-T 1x |  43.7   | [google](https://drive.google.com/file/d/1OpbCH5HuIlxwakNz4PzrAlJF3CxkLSYp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18HALSo2RHMBsX-Gbsi-YOw)(qev7) |
+| Mask R-CNN | Swin-T 3x |  46.0   | [google](https://drive.google.com/file/d/1oREwIk1ORhSsJcs4Y-Cfd0XrSEfPFP3-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tw607oogDWQ7Iz91ItfuGQ)(m8fg) |
+| Mask R-CNN | Swin-S 3x |  48.4   | [google](https://drive.google.com/file/d/1ZPWkz0zMzHJycHd6_s2hWDHIsW8SdZcK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ubC5_CKSq0ExQSINohukVg)(hdw5) |
+| Mask R-CNN | pvtv2_b0 		|  38.3   | [google](https://drive.google.com/file/d/1wA324LkFtGezHJovSZ4luVqSxVt9woFc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1q67ZIDSHn9Y-HU_WoQr8OQ)(3kqb) |
+| Mask R-CNN | pvtv2_b1 		|  41.8   | [google](https://drive.google.com/file/d/1alNaSmR4TSXsPpGoUZr2QQf5phYQjIzN/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1aSkuDiNpxdnFWE1Wn1SWNw)(k5aq) |
+| Mask R-CNN | pvtv2_b2 		|  45.2   | [google](https://drive.google.com/file/d/1tg6B5OEV4OWLsDxTCjsWgxgaSgIh4cID/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DLwxCZVZizb5HKih7RFw2w)(jh8b) |
+| Mask R-CNN | pvtv2_b2_linear 	|  44.1   | [google](https://drive.google.com/file/d/1b26vxK3QVGx5ovqKir77NyY6YPgAWAEj/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16T-Nyo_Jm2yDq4aoXpdnbg)(8ipt) |
+| Mask R-CNN | pvtv2_b3 		|  46.9   | [google](https://drive.google.com/file/d/1H6ZUCixCaYe1AvlBkuqYoxzz4b-icJ3u/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16QVsjUOXijo5d9cO3FZ39A)(je4y) |
+| Mask R-CNN | pvtv2_b4 		|  47.5   | [google](https://drive.google.com/file/d/1pXQNpn0BoKqiuVaGtJL18eWG6XmdlBOL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yhX7mpmb2wbRvWZFnUloBQ)(n3ay) |
+| Mask R-CNN | pvtv2_b5 		|  47.4   | [google](https://drive.google.com/file/d/12vOyw6pUfK1NdOWBF758aAZuaf-rZLvx/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-gasQk9PqLMkrWXw4aX41g)(jzq1) |
 
 ### Semantic Segmentation ###
 #### Pascal Context ####
@@ -243,11 +445,14 @@ We also provide tutorials:
 | UperNet  | Swin_Tiny |     16     |     160k   |  44.90   |       45.37     |   -      |[baidu](https://pan.baidu.com/s/1S8JR4ILw0u4I-DzU4MaeVQ)(lkhg)   |  [config](semantic_segmentation/configs/upernet_swin/upernet_swin_tiny_patch4_windown7_512x512_160k_ade20k.yaml) |
 | UperNet  | Swin_Small |     16     |     160k   |  47.88   |       48.90      |   -      |[baidu](https://pan.baidu.com/s/17RKeSpuWqONVptQZ3B4kEA)(vvy1)   |  [config](semantic_segmentation/configs/upernet_swin/upernet_swin_small_patch4_windown7_512x512_160k_ade20k.yaml) |
 | UperNet  | Swin_Base |     16     |     160k   |   48.59   |       49.04      |   -      |[baidu](https://pan.baidu.com/s/1bM15KHNsb0oSPblQwhxbgw)(y040)   |  [config](semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Tiny |     16     |     160k   |  49.46   |           |[baidu](https://pan.baidu.com/s/1ol_gykZjgAFbJ3PkqQ2j0Q)(l1cp) | [baidu](https://pan.baidu.com/s/1gLePNLybtrax9yCQ2fcIPg)(y1eq)  |  [config](seman}tic_segmentation/configs/upernet_cswin/upernet_cswin_tiny_patch4_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Small |     16     |     160k   |  50.88   |      | [baidu](https://pan.baidu.com/s/1mSd_JdNS4DtyVNYxqVobBw)(6vwk)   | [baidu](https://pan.baidu.com/s/1a_vhHoib0-BcRwTnnSVGWA)(fz2e)   | [config](semantic_segmentation/configs/upernet_cswin/upernet_cswin_small_patch4_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Base |     16     |     160k   |  50.64   |      | [baidu](https://pan.baidu.com/s/1suO0jX_Tw56CVm3UhByOWg)(0ys7)   | [baidu](https://pan.baidu.com/s/1Ym-RUooqizgUDEm5jWyrhA)(83w3)   | [config](semantic_segmentation/configs/upernet_cswin/upernet_cswin_base_patch4_512x512_160k_ade20k.yaml) |
 
 #### Trans10kV2 ####
 |Model      | Backbone  | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint     |     ConfigFile  |
 |-----------|-----------|------------|-----------|-----------|----------------|-----------------------------------------------|-----------------------------------------------------------------------|------------|
-|Trans2seg_Medium | Resnet50c |     16      |    80k    |  72.25  |      -        |   [google](https://drive.google.com/file/d/1C6nMg6DgQ73wzF21UwDVxmkcRTeKngnK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hs0tbSGIeMLLGMq05NN--w)(4dd5)    | [google](https://drive.google.com/file/d/1zGEBEN27CQMgZBYqqAg_agJE6CPLOpYW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/102GUBeoEPMqMEqF3smgyCA)(qcb0)   | [config](semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_80k_trans10kv2_bs_16.yaml)| 
+|Trans2seg_Medium | Resnet50c |     16      |    16k    |  75.97  |      -        |   [google](https://drive.google.com/file/d/1C6nMg6DgQ73wzF21UwDVxmkcRTeKngnK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hs0tbSGIeMLLGMq05NN--w)(4dd5)    | [google](https://drive.google.com/file/d/1C6nMg6DgQ73wzF21UwDVxmkcRTeKngnK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wdOUD6S8QGqD6S-98Yb37w)(w25r)   | [config](semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_16k_trans10kv2_bs_16.yaml)| 
 
 ### GAN ###
 | Model                          | FID | Image Size | Crop_pct | Interpolation | Model        |
@@ -260,7 +465,7 @@ We also provide tutorials:
 
 
 ## Quick Demo for Image Classification
-To use the model with pretrained weights, go to the specific subfolder e.g., `/image_classification/ViT/`, then download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `。、configs/`.  
+To use the model with pretrained weights, go to the specific subfolder e.g., `/image_classification/ViT/`, then download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs`.  
 
 Assume the downloaded weight file is stored in `./vit_base_patch16_224.pdparams`, to use the `vit_base_patch16_224` model in python:
 ```python
@@ -270,8 +475,8 @@ from visual_transformer import build_vit as build_model
 config = get_config('./configs/vit_base_patch16_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./vit_base_patch16_224')
+# load pretrained weights
+model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 > :robot: See the README file in each model folder for detailed usages.
@@ -286,12 +491,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/vit_base_patch16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/vit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./vit_base_patch16_224'
+    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -308,12 +513,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/vit_base_patch16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/vit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./vit_base_patch16_224'
+    -pretrained=/path/to/pretrained/model/vit_base_patch16_224   # .pdparams is NOT needed
 ```
 
 </details>
@@ -328,10 +533,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/vit_base_patch16_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/vit_base_patch16_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 
@@ -349,38 +554,16 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/vit_base_patch16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/vit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
 
 
 
-## Features ##
-1. State-of-the-art
-   - State-of-the-art transformer models for multiple CV tasks
-   - State-of-the-art data processings and training methods 
-   - We keep pushing it forward.
-
-2. Easy-to-use tools
-   - Easy configs for model vairants
-   - Modular design for utiliy functions and tools
-   - Low barrier for educators and practitioners
-   - Unified framework for all the models
-
-3. Easily customizable to your needs
-   - Examples for each model to reproduce the results
-   - Model implementations are exposed for you to customize
-   - Model files can be used independently for quick experiments
-
-4. High Performance
-   - DDP with a single GPU per process.
-   - Mixed-precision support (coming soon)
-
-
 ## Contributing ##
 * We encourage and appreciate your contribution to **PaddleViT** project, please refer to our workflow and work styles by [CONTRIBUTING.md](./CONTRIBUTING.md)
 
diff --git a/README_cn.md b/README_cn.md
new file mode 100644
index 00000000..cbb488d0
--- /dev/null
+++ b/README_cn.md
@@ -0,0 +1,577 @@
+简体中文 | [English](./README.md)
+
+# PaddlePaddle Vision Transformers #
+
+[![GitHub](https://img.shields.io/github/license/BR-IDL/PaddleViT?color=blue)](./LICENSE)
+[![CodeFactor](https://www.codefactor.io/repository/github/br-idl/paddlevit/badge)](https://www.codefactor.io/repository/github/br-idl/paddlevit)
+[![CLA assistant](https://cla-assistant.io/readme/badge/BR-IDL/PaddleViT)](https://cla-assistant.io/BR-IDL/PaddleViT)
+[![GitHub Repo stars](https://img.shields.io/github/stars/BR-IDL/PaddleViT?style=social)](https://github.com/BR-IDL/PaddleViT/stargazers)
+
+
+<p align="center">    
+    <img src="./PaddleViT.png" width="100%"/>
+</p>
+ 
+## State-of-the-art Visual Transformer and MLP Models for PaddlePaddle ##
+
+:robot: PaddlePaddle Visual Transformers (`PaddleViT` 或 `PPViT`) 为开发者提供视觉领域的高性能Transformer模型实现。 我们的主要实现基于Visual Transformers, Visual Attentions, 以及 MLPs等视觉模型算法。 此外，PaddleViT集成了PaddlePaddle 2.1+中常用的layers, utilities, optimizers, schedulers, 数据增强, 以及训练/评估脚本等。我们持续关注SOTA的ViT和MLP模型算法，并提供完整训练、测试代码。PaddleViT的核心任务是**为用户提供方便易用的CV领域前沿算法**。
+
+:robot: PaddleViT 为多项视觉任务提供模型和工具，例如图像分类，目标检测，语义分割，GAN等。每个模型架构均在独立的Python模块中定义，以便于用户能够快速的开展研究和进行实验。同时，我们也提供了模型的预训练权重文件，以便您加载并使用自己的数据集进行微调。PaddleViT还集成了常用的工具和模块，用于自定义数据集、数据预处理，性能评估以及分布式训练等。
+
+:robot: PaddleViT 基于深度学习框架 [PaddlePaddle](https://www.paddlepaddle.org/)进行开发, 我们同时在[Paddle AI Studio](https://aistudio.baidu.com/aistudio/index)上提供了项目教程(coming soon). 对于新用户能够简单易操作。
+
+
+## 视觉任务 ##
+PaddleViT 提供了多项视觉任务的模型和工具，请访问以下链接以获取详细信息： 
+- [PaddleViT-Cls](./image_classification) 用于 图像分类
+- [PaddleViT-Det](./object_detection/DETR) 用于 目标检测
+- [PaddleViT-Seg](./semantic_segmentation) 用于 语义分割
+- [PaddleViT-GAN](./gan) 用于 生成对抗模型
+  
+我们同时提供对应教程：
+- Notebooks (即将更新)
+- Online Course (即将更新)
+
+## 项目特性 ##
+1. **SOTA模型的完整实现**
+   - 提供多项CV任务的SOTA Transformer 模型 
+   - 提供高性能的数据处理和训练方法
+   - 持续推出最新的SOTA算法的实现
+
+2. **易于使用的工具**
+   - 通过简单配置即可实现对模型变体的实现
+   - 将实用功能与工具进行模块化设计
+   - 对于教育者和从业者的使用低门槛
+   - 所有模型以统一框架实现
+
+3. **符合用户的自定义需求**
+   - 提供每个模型的实现的最佳实践
+   - 提供方便用户调整自定义配置的模型实现
+   - 模型文件可以独立使用以便于用户快速复现算法
+
+4. **高性能**
+   - 支持DDP (多进程训练/评估，其中每个进程在单个GPU上运行)
+   - 支持混合精度 support (AMP)训练策略
+  
+
+  
+## ViT模型算法 ##
+
+### 图像分类 (Transformers) ###
+1. **[ViT](./image_classification/ViT)** (from Google), released with paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929), by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+2. **[DeiT](./image_classification/DeiT)** (from Facebook and Sorbonne), released with paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877), by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
+3. **[Swin Transformer](./image_classification/SwinTransformer)** (from Microsoft), released with paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030), by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
+4. **[VOLO](./image_classification/VOLO)** (from Sea AI Lab and NUS), released with paper [VOLO: Vision Outlooker for Visual Recognition](https://arxiv.org/abs/2106.13112), by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.
+5. **[CSwin Transformer](./image_classification/CSwin)** (from USTC and Microsoft), released with paper [CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
+](https://arxiv.org/abs/2107.00652), by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
+6. **[CaiT](./image_classification/CaiT)** (from Facebook and Sorbonne), released with paper [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239), by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.
+7. **[PVTv2](./image_classification/PVTv2)** (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper [PVTv2: Improved Baselines with Pyramid Vision Transformer](https://arxiv.org/abs/2106.13797), by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
+8. **[Shuffle Transformer](./image_classification/Shuffle_Transformer)** (from Tencent), released with paper [Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer](https://arxiv.org/abs/2106.03650), by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
+9. **[T2T-ViT](./image_classification/T2T_ViT)** (from NUS and YITU), released with paper [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
+](https://arxiv.org/abs/2101.11986), by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.
+10. **[CrossViT](./image_classification/CrossViT)** (from IBM), released with paper [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899), by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
+11. **[BEiT](./image_classification/BEiT)** (from Microsoft Research), released with paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254), by Hangbo Bao, Li Dong, Furu Wei.
+12. **[Focal Transformer](./image_classification/Focal_Transformer)** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+13. **[Mobile-ViT](./image_classification/MobileViT)** (from Apple), released with paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178), by Sachin Mehta, Mohammad Rastegari.
+14. **[ViP](./image_classification/ViP)** (from National University of Singapore), released with [Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition](https://arxiv.org/abs/2106.12368), by Qibin Hou and Zihang Jiang and Li Yuan and Ming-Ming Cheng and Shuicheng Yan and Jiashi Feng.
+15. **[XCiT](./image_classification/XCiT)** (from Facebook/Inria/Sorbonne), released with paper [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681), by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.
+16. **[PiT](./image_classification/PiT)** (from NAVER/Sogan University), released with paper [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302), by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.
+17. **[HaloNet](./image_classification/HaloNet)**, (from Google), released with paper [Scaling Local Self-Attention for Parameter Efficient Visual Backbones](https://arxiv.org/abs/2103.12731), by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
+18. **[PoolFormer](./image_classification/PoolFormer)**, (from Sea AI Lab/NUS), released with paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418), by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan.
+19. **[BoTNet](./image_classification/BoTNet)**, (from UC Berkeley/Google), released with paper [Bottleneck Transformers for Visual Recognition](https://arxiv.org/abs/2101.11605), by Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani.
+20. **[CvT](./image_classification/CvT)** (from McGill/Microsoft), released with paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808), by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
+21. **[HvT](./image_classification/HVT)** (from Monash University), released with paper [Scalable Vision Transformers with Hierarchical Pooling](https://arxiv.org/abs/2103.10619), by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.
+
+
+
+### 图像分类 (MLP & others) ###
+1. **[MLP-Mixer](./image_classification/MLP-Mixer)** (from Google), released with paper [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601), by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
+2. **[ResMLP](./image_classification/ResMLP)** (from Facebook/Sorbonne/Inria/Valeo), released with paper [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404), by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
+3. **[gMLP](./image_classification/gMLP)** (from Google), released with paper [Pay Attention to MLPs](https://arxiv.org/abs/2105.08050), by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
+4. **[FF Only](./image_classification/FF_Only)** (from Oxford), released with paper [Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet](https://arxiv.org/abs/2105.02723), by Luke Melas-Kyriazi.
+5. **[RepMLP](./image_classification/RepMLP)** (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883), by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
+6. **[CycleMLP](./image_classification/CycleMLP)** (from HKU/SenseTime), released with paper [CycleMLP: A MLP-like Architecture for Dense Prediction](https://arxiv.org/abs/2107.10224), by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
+7. **[ConvMixer](./image_classification/ConvMixer)** (from Anonymous), released with [Patches Are All You Need?](https://openreview.net/forum?id=TVHS5Y4dNvM), by Anonymous.
+8. **[ConvMLP](./image_classification/ConvMLP)** (from UO/UIUC/PAIR), released with [ConvMLP: Hierarchical Convolutional MLPs for Vision](https://arxiv.org/abs/2109.04454), by Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi.
+
+
+#### 即将更新: ####
+1. **[DynamicViT]()** (from Tsinghua/UCLA/UW), released with paper [DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification](https://arxiv.org/abs/2106.02034), by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh.
+
+
+
+### 目标检测 ###
+1. **[DETR](./object_detection/DETR)** (from Facebook), released with paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872), by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
+2. **[Swin Transformer](./object_detection/Swin)** (from Microsoft), released with paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030), by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
+3. **[PVTv2](./object_detection/PVTv2)** (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper [PVTv2: Improved Baselines with Pyramid Vision Transformer](https://arxiv.org/abs/2106.13797), by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
+
+#### 即将更新: ####
+1. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+2. **[UP-DETR]()** (from Tencent), released with paper [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers](https://arxiv.org/abs/2011.09094), by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.
+
+
+
+
+### 目标分割 ###
+#### 现有模型: ####
+1. **[SETR](./semantic_segmentation)** (from Fudan/Oxford/Surrey/Tencent/Facebook), released with paper [Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers](https://arxiv.org/abs/2012.15840), by Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang.
+2. **[DPT](./semantic_segmentation)** (from Intel), released with paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413), by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
+3. **[Swin Transformer](./semantic_segmentation)** (from Microsoft), released with paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030), by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
+2. **[Segmenter](./semantic_segmentation)** (from Inria), realeased with paper [Segmenter: Transformer for Semantic Segmentation](https://arxiv.org/pdf/2105.05633.pdf), by Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid.
+3. **[Trans2seg](./semantic_segmentation)** (from HKU/Sensetime/NJU), released with paper [Segmenting Transparent Object in the Wild with Transformer](https://arxiv.org/pdf/2101.08461.pdf), by Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo.
+4. **[SegFormer](./semantic_segmentation)** (from HKU/NJU/NVIDIA/Caltech), released with paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203), by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
+5. **[CSwin Transformer]()** (from USTC and Microsoft), released with paper [CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
+
+#### 即将更新:  ####
+1. **[FTN]()** (from Baidu), released with paper [Fully Transformer Networks for Semantic Image Segmentation](https://arxiv.org/pdf/2106.04108.pdf), by Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo.
+2. **[Shuffle Transformer]()** (from Tencent), released with paper [Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer](https://arxiv.org/abs/2106.03650), by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
+3. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+](https://arxiv.org/abs/2107.00652), by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
+
+
+### GAN ###
+1. **[TransGAN](./gan/transGAN)** (from Seoul National University and NUUA), released with paper [TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up](https://arxiv.org/abs/2102.07074), by Yifan Jiang, Shiyu Chang, Zhangyang Wang.
+2. **[Styleformer](./gan/Styleformer)** (from Facebook and Sorbonne), released with paper [Styleformer: Transformer based Generative Adversarial Networks with Style Vector](https://arxiv.org/abs/2106.07023), by Jeeseung Park, Younggeun Kim.
+#### 即将更新: ####
+1. **[ViTGAN]()** (from UCSD/Google), released with paper [ViTGAN: Training GANs with Vision Transformers](https://arxiv.org/pdf/2107.04589), by Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu.
+
+
+
+## 安装
+### 准备
+* Linux/MacOS/Windows
+* Python 3.6/3.7
+* PaddlePaddle 2.1.0+
+* CUDA10.2+
+> 注意: 建议安装最新版本的 PaddlePaddle 以避免训练PaddleViT时出现一些 CUDA 错误。  PaddlePaddle稳定版安装请参考[链接](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) ， PaddlePaddle开发版安装请参考[链接](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html#gpu). 
+### 安装
+1. 创建Conda虚拟环境并激活.
+   ```shell
+   conda create -n paddlevit python=3.7 -y
+   conda activate paddlevit
+   ```
+2. 按照官方说明安装 PaddlePaddle, e.g.,
+   ```shell
+   conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+   ```
+   > 注意: 请根据您的环境更改 paddlepaddle 版本 和 cuda 版本.
+
+3. 安装依赖项.
+    *  通用的依赖项:
+        ```
+        pip install yacs pyyaml
+        ```
+    *  分割需要的依赖项:
+        ```
+        pip install cityscapesScripts
+        ```
+        安装 `detail` package:
+        ```shell
+        git clone https://github.com/ccvl/detail-api
+        cd detail-api/PythonAPI
+        make
+        make install
+        ```
+    *  GAN需要的依赖项:
+        ```
+        pip install lmdb
+        ```
+4. 从GitHub克隆项目
+    ```
+    git clone https://github.com/BR-IDL/PaddleViT.git 
+    ```
+
+
+## 预训练模型和下载 (Model Zoo) ## 
+### 图像分类 ###
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop pct | Interp | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| vit_base_patch32_224          | 80.68 | 95.61 | 88.2M   | 4.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1DPEhEuu9sDdcmOPukQbR7ZcHq2bxx9cr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ppOLj5SWlJmA-NjoLCoYIw)(ubyr) |
+| vit_base_patch32_384          | 83.35 | 96.84 | 88.2M   | 12.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1nCOSwrDiFBFmTkLEThYwjL9SfyzkKoaf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jxnL00ocpmdiPM4fOu4lpg)(3c2f) |
+| vit_base_patch16_224          | 84.58 | 97.30 | 86.4M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/13D9FqU4ISsGxWXURgKW9eLOBV-pYPr-L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ms3o2fHMQpIoVqnEHitRtA)(qv4n) |
+| vit_base_patch16_384          | 85.99 | 98.00 | 86.4M   | 49.8G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kWKaAgneDx0QsECxtf7EnUdUZej6vSFT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15ggLdiL98RPcz__SXorrXA)(wsum) |
+| vit_large_patch16_224         | 85.81 | 97.82 | 304.1M  | 59.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1jgwtmtp_cDWEhZE-FuWhs7lCdpqhAMft/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1HRxUJAwEiKgrWnJSjHyU0A)(1bgk) |
+| vit_large_patch16_384         | 87.08 | 98.30 | 304.1M  | 175.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zfw5mdiIm-mPxxQddBFxt0xX-IR-PF2U/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KvxfIpMeitgXAUZGr5HV8A)(5t91) |
+| vit_large_patch32_384         | 81.51 | 96.09 | 306.5M  | 44.4G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Py1EX3E35jL7DComW-29Usg9788BB26j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1W8sUs0pObOGpohP4vsT05w)(ieg3) |
+| | | | | | | | | |
+| swin_t_224   					| 81.37 | 95.54 | 28.3M   | 4.4G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1v_wzWv3TaQ0RKkKwRQwuDPzwpOb_jGEs/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tbc751RVh3fIRsrLzrmeOw)(h2ac) |
+| swin_s_224   					| 83.21 | 96.32 | 49.6M   | 8.6G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1lrODzr8zIOU9sBrH2x3zolMOS4mv4o7x/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rlXL0tjLWbWnkIt_2Ne8Jw)(ydyx) |
+| swin_b_224   					| 83.60 | 96.46 | 87.7M   | 15.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1hjEVODThNEDAlIqkg8C1KzUh3KsVNu6R/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ucSHBiuiG2sHAmR1N1JENQ)(h4y6) |
+| swin_b_384   					| 84.48 | 96.89 | 87.7M   | 45.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1szLgwhB6WJu02Me6Uyz94egk8SqKlNsd/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1t0oXbqKNwpUAMJV7VTzcNw)(7nym) |
+| swin_b_224_22kto1k    		| 85.27 | 97.56 | 87.7M   | 15.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1FhdlheMUlJzrZ7EQobpGRxd3jt3aQniU/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KBocL_M6YNW1ZsK-GYFiNw)(6ur8) |
+| swin_b_384_22kto1k    		| 86.43 | 98.07 | 87.7M   | 45.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zVwIrJmtuBSiSVQhUeblRQzCKx-yWNCA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NziwdsEJtmjfGCeUFgtZXA)(9squ) |
+| swin_l_224_22kto1k    		| 86.32 | 97.90 | 196.4M  | 34.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1yo7rkxKbQ4izy2pY5oQ5QAnkyv7zKcch/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GsUJbSkGxlGsBYsayyKjVg)(nd2f) |
+| swin_l_384_22kto1k    		| 87.14 | 98.23 | 196.4M  | 100.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1-6DEvkb-FMz72MyKtq9vSPKYBqINxoKK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JLdS0aTl3I37oDzGKLFSqA)(5g5e) |
+| | | | | | | | | |
+| deit_tiny_distilled_224   	| 74.52 | 91.90 | 5.9M    | 1.1G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1fku9-11O_gQI7UpZTjagVeND-pcHbV0C/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hAQ_85wWkqQ7sIGO1CmO9g)(rhda) |
+| deit_small_distilled_224  	| 81.17 | 95.41 | 22.4M   | 4.3G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1RIeWTdf5o6pwkjqN4NbW91GZSOCalI5t/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wCVrukvwxISAGGjorPw3iw)(pv28) |
+| deit_base_distilled_224  		| 83.32 | 96.49 | 87.2M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/12_x6-NN3Jde2BFUih4OM9NlTwe9-Xlkw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ZnmAWgT6ewe7Vl3Xw_csuA)(5f2g) |
+| deit_base_distilled_384  		| 85.43 | 97.33 | 87.2M   | 49.9G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1i5H_zjSdHfM-Znv89DHTv9ChykWrIt8I/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PQsQIci4VCHY7l2tCzMklg)(qgj2) |
+| | | | | | | | | |
+| volo_d1_224  					| 84.12 | 96.78 | 26.6M   | 6.6G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kNNtTh7MUWJpFSDe_7IoYTOpsZk5QSR9/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1EKlKl2oHi_24eaiES67Bgw)(xaim) |
+| volo_d1_384  					| 85.24 | 97.21 | 26.6M   | 19.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1fku9-11O_gQI7UpZTjagVeND-pcHbV0C/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qZWoFA7J89i2aujPItEdDQ)(rr7p) |
+| volo_d2_224  					| 85.11 | 97.19 | 58.6M   | 13.7G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1KjKzGpyPKq6ekmeEwttHlvOnQXqHK1we/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JCK0iaYtiOZA6kn7e0wzUQ)(d82f) |
+| volo_d2_384  					| 86.04 | 97.57 | 58.6M   | 40.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1uLLbvwNK8N0y6Wrq_Bo8vyBGSVhehVmq/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1e7H5aa6miGpCTCgpK0rm0w)(9cf3) |
+| volo_d3_224  					| 85.41 | 97.26 | 86.2M   | 19.8G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1OtOX7C29fJ20ESKQnYGevp4euxhmXKAT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1vhARtV2wfI6EFf0Ap71xwg)(a5a4) |
+| volo_d3_448  					| 86.50 | 97.71 | 86.2M   | 80.3G  | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lHlYhra1NNp0dp4NWaQ9SMNNmw-AxBNZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Q6KiQw4Vu1GPm5RF9_eycg)(uudu) |
+| volo_d4_224  					| 85.89 | 97.54 | 192.8M  | 42.9G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/16oXN7xuy-mkpfeD-loIVOK95PfptHhpX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PE83ZLd5evkKmHJ1V2KDsg)(vcf2) |
+| volo_d4_448  					| 86.70 | 97.85 | 192.8M  | 172.5G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1N9-1OhPewA5TBR9CX5oA10obDS8e4Cfa/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QoJ2Sqe1SK9hxbmV4uZiyg)(nd4n) |
+| volo_d5_224  					| 86.08 | 97.58 | 295.3M  | 70.6G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1fcrvOGbAmKUhqJT-pU3MVJZQJIe4Qina/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nqDcXMW00v9PKr3RQI-g1w)(ymdg) |
+| volo_d5_448  					| 86.92 | 97.88 | 295.3M  | 283.8G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1aFXEkpfLhmQlDQHUYCuFL8SobhxUzrZX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1K4FBv6fnyMGcAXhyyybhgw)(qfcc) |
+| volo_d5_512  					| 87.05 | 97.97 | 295.3M  | 371.3G | 512        | 1.15     | bicubic       | [google](https://drive.google.com/file/d/1CS4-nv2c9FqOjMz7gdW5i9pguI79S6zk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16Wseyiqvv0MQJV8wwFDfSA)(353h) |
+| | | | | | | | | |
+| cswin_tiny_224  				| 82.81 | 96.30 | 22.3M   | 4.2G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1l-JY0u7NGyD6SjkyiyNnDx3wFFT1nAYO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L5FqU7ImWAhQHAlSilqVAw)(4q3h) |
+| cswin_small_224 				| 83.60 | 96.58 | 34.6M   | 6.5G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/10eEBk3wvJdQ8Dy58LvQ11Wk1K2UfPy-E/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FiaNiWyAuWu1IBsUFLUaAw)(gt1a) |
+| cswin_base_224  				| 84.23 | 96.91 | 77.4M   | 14.6G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YufKh3DKol4-HrF-I22uiorXSZDIXJmZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koy8hXyGwvgAfUxdlkWofg)(wj8p) |
+| cswin_base_384  				| 85.51 | 97.48 | 77.4M   | 43.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1qCaFItzFoTYBo-4UbGzL6M5qVDGmJt4y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WNkY7o_vP9KJ8cd5c7n2sQ)(rkf5) |
+| cswin_large_224 				| 86.52 | 97.99 | 173.3M  | 32.5G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1V1hteGK27t1nI84Ac7jdWfydBLLo7Fxt/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KgIX6btML6kPiPGkIzvyVA)(b5fs) |
+| cswin_large_384 				| 87.49 | 98.35 | 173.3M  | 96.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LRN_6qUz71yP-OAOpN4Lscb8fkUytMic/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1eCIpegPj1HIbJccPMaAsew)(6235) |
+| | | | | | | | | |
+| cait_xxs24_224                | 78.38 | 94.32 | 11.9M   | 2.2G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LKsQUr824oY4E42QeUEaFt41I8xHNseR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YIaBLopKIK5_p7NlgWHpGA)(j9m8) |
+| cait_xxs36_224                | 79.75 | 94.88 | 17.2M   | 33.1G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zZx4aQJPJElEjN5yejUNsocPsgnd_3tS/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1pdyFreRRXUn0yPel00-62Q)(nebg) |
+| cait_xxs24_384                | 80.97 | 95.64 | 11.9M   | 6.8G   | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1J27ipknh_kwqYwR0qOqE9Pj3_bTcTx95/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1uYSDzROqCVT7UdShRiiDYg)(2j95) |
+| cait_xxs36_384                | 82.20 | 96.15 | 17.2M   | 10.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/13IvgI3QrJDixZouvvLWVkPY0J6j0VYwL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GafA8B6T3h_vtmNNq2HYKg)(wx5d) |
+| cait_s24_224                  | 83.45 | 96.57 | 46.8M   | 8.7G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1sdCxEw328yfPJArf6Zwrvok-91gh7PhS/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BPsAMEcrjtnbOnVDQwZJYw)(m4pn) |
+| cait_xs24_384                 | 84.06 | 96.89 | 26.5M   | 15.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zKL6cZwqmvuRMci-17FlKk-lA-W4RVte/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w10DPJvK8EwhOCm-tZUpww)(scsv) |
+| cait_s24_384                  | 85.05 | 97.34 | 46.8M   | 26.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1klqBDhJDgw28omaOpgzInMmfeuDa7NAi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-aNO6c7Ipm9x1hJY6N6G2g)(dnp7) |
+| cait_s36_384                  | 85.45 | 97.48 | 68.1M   | 39.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1m-55HryznHbiUxG38J2rAa01BYcjxsRZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-uWg-JHLEKeMukFFctoufg)(e3ui) |
+| cait_m36_384                  | 86.06 | 97.73 | 270.7M  | 156.2G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1WJjaGiONX80KBHB3YN8mNeusPs3uDhR2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1aZ9bEU5AycmmfmHAqZIaLA)(r4hu) |
+| cait_m48_448                  | 86.49 | 97.75 | 355.8M  | 287.3G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lJSP__dVERBNFnp7im-1xM3s_lqEe82-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/179MA3MkG2qxFle0K944Gkg)(imk5) |
+| | | | | | | | | |
+| pvtv2_b0 						| 70.47	| 90.16	| 3.7M    | 0.6G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1wkx4un6y7V87Rp_ZlD4_pV63QRst-1AE/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mab4dOtBB-HsdzFJYrvgjA)(dxgb) |
+| pvtv2_b1 						| 78.70	| 94.49	| 14.0M   | 2.1G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/11hqLxL2MTSnKPb-gp2eMZLAzT6q2UsmG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Ur0s4SEOxVqggmgq6AM-sQ)(2e5m) |
+| pvtv2_b2 						| 82.02	| 95.99	| 25.4M   | 4.0G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1-KY6NbS3Y3gCaPaUam0v_Xlk1fT-N1Mz/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FWx0QB7_8_ikrPIOlL7ung)(are2) |
+| pvtv2_b2_linear 				| 82.06	| 96.04	| 22.6M   | 3.9G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1hC8wE_XanMPi0_y9apEBKzNc4acZW5Uy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IAhiiaJPe-Lg1Qjxp2p30w)(a4c8) |
+| pvtv2_b3 						| 83.14	| 96.47	| 45.2M   | 6.8G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/16yYV8x7aKssGYmdE-YP99GMg4NKGR5j1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ge0rBsCqIcpIjrVxsrFhnw)(nc21) |
+| pvtv2_b4 						| 83.61	| 96.69	| 62.6M   | 10.0G  | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1gvPdvDeq0VchOUuriTnnGUKh0N2lj-fA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VMSD_Kr_hduCZ5dxmDbLoA)(tthf) |
+| pvtv2_b5 						| 83.77	| 96.61	| 82.0M   | 11.5G  | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1OHaHiHN_AjsGYBN2gxFcQCDhBbTvZ02g/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ey4agxI2Nb0F6iaaX3zAbA)(9v6n) |
+| | | | | | | | | | 
+| shuffle_vit_tiny  			| 82.39 | 96.05 | 28.5M   | 4.6G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ffJ-tG_CGVXztPEPQMaT_lUoc4hxFy__/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19DhlLIFyPGOWtyq_c83ZGQ)(8a1i) |
+| shuffle_vit_small 			| 83.53 | 96.57 | 50.1M   | 8.8G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1du9H0SKr0QH9GQjhWDOXOnhpSVpfbb8X/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rM2J8BVwxQ3kRZoHngwNZA)(xwh3) |
+| shuffle_vit_base  			| 83.95 | 96.91 | 88.4M   | 15.5G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1sYh808AyTG3-_qv6nfN6gCmyagsNAE6q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1fks_IYDdnXdAkCFuYHW_Nw)(1gsr) |
+| | | | | | | | | |
+| t2t_vit_7      				| 71.68 | 90.89 | 4.3M    | 1.0G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YkuPs1ku7B_udydOf_ls1LQvpJDg_c_j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jVNsz37gatLCDaOoU3NaMA)(1hpa) |
+| t2t_vit_10     				| 75.15 | 92.80 | 5.8M    | 1.3G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1H--55RxliMDlOCekn7FpKrHDGsUkyrJZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nbdb4PFMq4nsIp8HrNxLQg)(ixug) |
+| t2t_vit_12     				| 76.48 | 93.49 | 6.9M    | 1.5G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1stnIwOwaescaEcztaF1QjI4NK4jaqN7P/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DcMzq9WeSwrS3epv6jKJXw)(qpbb) |
+| t2t_vit_14     				| 81.50 | 95.67 | 21.5M   | 4.4G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1HSvN3Csgsy7SJbxJYbkzjUx9guftkfZ1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wcfh22uopBv7pS7rKcH_iw)(c2u8) |
+| t2t_vit_19     				| 81.93 | 95.74 | 39.1M   | 7.8G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1eFnhaL6I33pHCQw2BaEE0Oet9CnjmUf_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_24     				| 82.28 | 95.89 | 64.0M   | 12.8G  | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Z7nZCHeFp0AhIkGYcMAFkKdkGN0yXtpv/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_t_14   				| 81.69 | 95.85 | 21.5M   | 4.4G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/16li4voStt_B8eWDXqJt7s20OT_Z8L263/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_t_19   				| 82.44 | 96.08 | 39.1M   | 7.9G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Ty-42SYOu15Nk8Uo6VRTJ7J0JV_6t7zJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YdQd6l8tj5xMCWvcHWm7sg)(mier) |
+| t2t_vit_t_24   				| 82.55 | 96.07 | 64.0M   | 12.9G  | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1cvvXrGr2buB8Np2WlVL7n_F1_CnI1qow/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BMU3KX_TRmPxQ1jN5cmWhg)(6vxc) |
+| t2t_vit_14_384 				| 83.34 | 96.50 | 21.5M   | 13.0G  | 384   	    | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Yuso8WD7Q8Lu_9I8dTvAvkcXXtPSkmnm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AOMhyVRF9zPqJe-lTrd7pw)(r685) |
+| | | | | | | | | |
+| cross_vit_tiny_224 			| 73.20 | 91.90 | 6.9M    | 1.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ILTVwQtetcb_hdRjki2ZbR26p-8j5LUp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1byeUsM34_gFL0jVr5P5GAw)(scvb) |
+| cross_vit_small_224 			| 81.01 | 95.33 | 26.7M   | 5.2G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ViOJiwbOxTbk1V2Go7PlCbDbWPbjWPJH/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1I9CrpdPU_D5LniqIVBoIPQ)(32us) |
+| cross_vit_base_224 			| 82.12 | 95.87 | 104.7M  | 20.2G  | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vTorkc63O4JE9cYUMHBRxFMDOFoC-iK7/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1TR_aBHQ2n1J0RgHFoVh_bw)(jj2q) |
+| cross_vit_9_224 				| 73.78 | 91.93 | 8.5M    | 1.6G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1UCX9_mJSx2kDAmEd_xDXyd4e6-Mg3RPf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1M8r5vqMHJ-rFwBoW1uL2qQ)(mjcb) |
+| cross_vit_15_224 				| 81.51 | 95.72 | 27.4M   | 5.2G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HwkLWdz6A3Nz-dVbw4ZUcCkxUbPXgHwM/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wiO_Gjk4fvSq08Ud8xKwVw)(n55b) |
+| cross_vit_18_224 				| 82.29 | 96.00 | 43.1M   | 8.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1C4b_a_6ia8NCEXSUEMDdCEFzedr0RB_m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w7VJ7DNqq6APuY7PdlKEjA)(xese) |
+| cross_vit_9_dagger_224 		| 76.92 | 93.61 | 8.7M    | 1.7G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1_cXQ0M8Hr9UyugZk07DrsBl8dwwCA6br/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1F1tRSaG4EfCV_WiTEwXxBw)(58ah) |
+| cross_vit_15_dagger_224 		| 82.23 | 95.93 | 28.1M   | 5.6G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1cCgBoozh2WFtSz42LwEUUPPyC5KmkAFg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1xJ4P2zy3r9RcNFSMtzvZgg)(qwup) |
+| cross_vit_18_dagger_224 		| 82.51 | 96.03 | 44.1M   | 8.7G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1sdAbWxKL5k3QIo1zdgHzasIOtpy_Ogpw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15qYHgt0iRxdhtXoC_ct2Jg)(qtw4) |
+| cross_vit_15_dagger_384 		| 83.75 | 96.75 | 28.1M   | 16.4G  | 384   	    | 1.0      | bicubic       | [google](https://drive.google.com/file/d/12LQjYbs9-LyrY1YeRt46x9BTB3NJuhpJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1d-BAm03azLP_CyEHF3c7ZQ)(w71e) |
+| cross_vit_18_dagger_384 		| 84.17 | 96.82 | 44.1M   | 25.8G  | 384   	    | 1.0 	   | bicubic       | [google](https://drive.google.com/file/d/1CeGwB6Tv0oL8QtL0d7Ar-d02Lg_PqACr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1l_6PTldZ3IDB7XWgjM6LhA)(99b6) |
+| | | | | | | | | | 
+| beit_base_patch16_224_pt22k   | 85.21 | 97.66 | 87M    | 12.7G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1lq5NeQRDHkIQi7U61OidaLhNsXTWfh_Z/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1pjblqaESqfXVrpgo58oR6Q)(fshn) |
+| beit_base_patch16_384_pt22k   | 86.81 | 98.14 | 87M    | 37.3G   | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1wn2NS7kUdlERkzWEDeyZKmcRbmWL7TR2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WVbNjxuIUh514pKAgZZEzg)(arvc) |
+| beit_large_patch16_224_pt22k  | 87.48 | 98.30 | 304M   | 45.0G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/11OR1FKxzfafqT7GzTW225nIQjxmGSbCm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1bvhERVXN2TyRcRJFzg7sIA)(2ya2) |
+| beit_large_patch16_384_pt22k  | 88.40 | 98.60 | 304M   | 131.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/10EraafYS8CRpEshxClOmE2S1eFCULF1Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1H76G2CGLY3YmmYt4-suoRA)(qtrn) |
+| beit_large_patch16_512_pt22k  | 88.60 | 98.66 | 304M   | 234.0G  | 512        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1xIIocftsB1PcDHZttPqLdrJ-G4Tyfrs-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WtTVK_Wvg-izaF0M6Gzw-Q)(567v) |
+| | | | | | | | | | 
+| Focal-T    					| 82.03 | 95.86 | 28.9M   | 4.9G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HzZJbYH_eIo94h0wLUhqTyJ6AYthNKRh/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JCr2qIA-SZvTqbTO-m2OwA)(i8c2) |
+| Focal-T (use conv)   			| 82.70 | 96.14 | 30.8M   | 4.9G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1PS0-gdXHGl95LqH5k5DG62AH6D3i7v0D/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tVztox4bVJuJEjkD1fLaHQ)(smrk) |
+| Focal-S    					| 83.55 | 96.29 | 51.1M   | 9.4G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HnVAYsI_hmiomyS4Ax3ccPE7gk4mlTU8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1b7uugAY9RhrgTkUwYcvvow)(dwd8) |
+| Focal-S (use conv)   			| 83.85 | 96.47 | 53.1M   | 9.4G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vcHjYiGNMayoSTPoM8z39XRH6h89TB9V/view?usp=sharing)/[baidu](https://pan.baidu.com/s/174a2aZzCEt3teLuAnIzMtA)(nr7n) |
+| Focal-B    					| 83.98 | 96.48 | 89.8M   | 16.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1bNMegxetWpwZNcmDEC3MHCal6SNXSgWR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1piBslNhxWR78aQJIdoZjEw)(8akn) |
+| Focal-B (use conv)   			| 84.18 | 96.61 | 93.3M   | 16.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1-J2gDnKrvZGtasvsAYozrbMXR2LtIJ43/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GTLfnTlt6I6drPdfSWB1Iw)(5nfi) |
+| | | | | | | | | | 
+| mobilevit_xxs   				| 70.31| 89.68 | 1.32M   | 0.44G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1l3L-_TxS3QisRUIb8ohcv318vrnrHnWA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KFZ5G834_-XXN33W67k8eg)(axpc) |
+| mobilevit_xs   				| 74.47| 92.02 | 2.33M   | 0.95G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1oRMA4pNs2Ba0LYDbPufC842tO4OFcgwq/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IP8S-S6ZAkiL0OEsiBWNkw)(hfhm) |
+| mobilevit_s   				| 76.74| 93.08 | 5.59M   | 1.88G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1ibkhsswGYWvZwIRjwfgNA4-Oo2stKi0m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-rI6hiCHZaI7os2siFASNg)(34bg) |
+| mobilevit_s $\dag$  			| 77.83| 93.83 | 5.59M   | 1.88G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1BztBJ5jzmqgDWfQk-FB_ywDWqyZYu2yG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19YepMAO-sveBOLA4aSjIEQ?pwd=92ic)(92ic) |
+| | | | | | | | | | 
+| vip_s7  						| 81.50 | 95.76 | 25.1M   | 7.0G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/16bZkqzbnN08_o15k3MzbegK8SBwfQAHF/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1uY0FsNPYaM8cr3ZCdAoVkQ)(mh9b) |
+| vip_m7  						| 82.75 | 96.05 | 55.3M   | 16.4G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/11lvT2OXW0CVGPZdF9dNjY_uaEIMYrmNu/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1j3V0Q40iSqOY15bTKlFFRw)(hvm8) |
+| vip_l7  						| 83.18 | 96.37 | 87.8M   | 24.5G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1bK08JorLPMjYUep_TnFPKGs0e1j0UBKJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1I5hnv3wHWEaG3vpDqaNL-w)(tjvh) |
+| | | | | | | | | | 
+| xcit_nano_12_p16_224_dist   | 72.32  | 90.86  | 0.6G    | 3.1M      | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/14FsYtm48JB-rQFF9CanJsJaPESniWD7q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15kdY4vzwU2QiBSU5127AYA)(7qvz)     |
+| xcit_nano_12_p16_384_dist   | 75.46  | 92.70  | 1.6G    | 3.1M      | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zR-hFQryocF9muG-erzcxFuJme5y_e9f/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1449qtQzEMg6lqdtClyiCRQ)(1y2j)     |
+| xcit_large_24_p16_224_dist  | 84.92  | 97.13  | 35.9G   | 189.1M    | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lAtko_KwOagjwaFvUkeXirVClXCV8gt-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Gs401mXqG1bifi1hBdXtig)(kfv8)     |
+| xcit_large_24_p16_384_dist  | 85.76  | 97.54  | 105.5G  | 189.1M    | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/15djnKz_-eooncvyZp_UTwOiHIm1Hxo_G/view?usp=sharing)/[baidu](https://pan.baidu.com/s/14583hbtIVbZ_2ifZepQItQ)(ffq3)     |
+| xcit_nano_12_p8_224_dist    | 76.33  | 93.10  | 2.2G    | 3.0M      | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1XxRNjskLvSVp6lvhlsnylq6g7vd_5MsI/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DZJxuahFJyz-rEEsCqhhrA)(jjs7)     |
+| xcit_nano_12_p8_384_dist    | 77.82  | 94.04  | 6.3G    | 3.0M      | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1P3ln8JqLzMKbJAhCanRbu7i5NMPVFNec/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ECY9-PVDMNSup8NMQiqBrw)(dmc1)     |
+| xcit_large_24_p8_224_dist   | 85.40  | 97.40  | 141.4G  | 188.9M    | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/14ZoDxEez5NKVNAsbgjTPisjOQEAA30Wy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1D_zyvjzIVFp6iqx1s7IEbA)(y7gw)     |
+| xcit_large_24_p8_384_dist   | 85.99  | 97.69  | 415.5G  | 188.9M    | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1stcUwwFNJ38mdaFsNXq24CBMmDenJ_e4/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1lwbBk7GFuqnnP_iU2OuDRw)(9xww)     |
+| | | | | | | | | |
+| pit_ti 	     | 72.91	| 91.40	| 4.8M    | 0.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1bbeqzlR_CFB8CAyTUN52p2q6ii8rt0AW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Yrq5Q16MolPYHQsT_9P1mw)(ydmi)  |
+| pit_ti_distill | 74.54	| 92.10 | 5.1M    | 0.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1m4L0OVI0sYh8vCv37WhqCumRSHJaizqX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1RIM9NGq6pwfNN7GJ5WZg2w)(7k4s)  |
+| pit_xs 	     | 78.18    | 94.16 | 10.5M   | 1.1G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1qoMQ-pmqLRQmvAwZurIbpvgMK8MOEgqJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15d7ep05vI2UoKvL09Zf_wg)(gytu)  |
+| pit_xs_distill | 79.31 	| 94.36 | 10.9M   | 1.1G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1EfHOIiTJOR-nRWE5AsnJMsPCncPHEgl8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DqlgVF7U5qHfGD3QJAad4A)(ie7s)  |
+| pit_s  		 | 81.08 	| 95.33 | 23.4M   | 2.4G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1TDSybTrwQpcFf9PgCIhGX1t-f_oak66W/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Vk-W1INskQq7J5Qs4yphCg)(kt1n)  |
+| pit_s_distill  | 81.99 	| 95.79 | 24.0M   | 2.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1U3VPP6We1vIaX-M3sZuHmFhCQBI9g_dL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L7rdWmMW8tiGkduqmak9Fw)(hhyc)  |
+| pit_b   		 | 82.44 	| 95.71 | 73.5M	  | 10.6G  | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1-NBZ9-83nZ52jQ4DNZAIj8Xv6oh54nx-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1XRDPY4OxFlDfl8RMQ56rEg)(uh2v)  |
+| pit_b_distill  | 84.14 	| 96.86 | 74.5M   | 10.7G  | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/12Yi4eWDQxArhgQb96RXkNWjRoCsDyNo9/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1vJOUGXPtvC0abg-jnS4Krw)(3e6g)  |
+| | | | | | | | | |
+| halonet26t 	 | 79.10	| 94.31	| 12.5M    | 3.2G   | 256        | 0.95     | bicubic       |[google](https://drive.google.com/file/d/1F_a1brftXXnPM39c30NYe32La9YZQ0mW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FSlSTuYMpwPJpi4Yz2nCTA)(ednv)  |
+| halonet50ts 	 | 81.65	| 95.61	| 22.8M    | 5.1G   | 256        | 0.94     | bicubic       |[google](https://drive.google.com/file/d/12t85kJcPA377XePw6smch--ELMBo6p0Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1X4LM-sqoTKG7CrM5BNjcdA)(3j9e)  |
+| | | | | | | | | |
+| poolformer_s12 | 77.24 | 93.51 | 11.9M   | 1.8G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/15EBfTTU6coLCsDNiLgAWYiWeMpp3uYH4/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1n6TUxQGlssTu4lyLrBOXEw)(zcv4)             |
+| poolformer_s24 | 80.33 | 95.05 | 21.3M   | 3.4G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1JxqJluDpp1wwe7XtpTi1aWaVvlq0Q3xF/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1d2uyHB5R6ZWPzXWhdtm6fw)(nedr)             |
+| poolformer_s36 | 81.43 | 95.45 | 30.8M   | 5.0G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1ka3VeupDRFBSzzrcw4wHXKGqoKv6sB_Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1de6ZJkmYEmVI7zKUCMB_xw)(fvpm)             |
+| poolformer_m36 | 82.11 | 95.69 | 56.1M   | 8.9G   | 224        | 0.95     | bicubic       | [google](https://drive.google.com/file/d/1LTZ8wNRb_GSrJ9H3qt5-iGiGlwa4dGAK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qNTYLw4vyuoH1EKDXEcSvw)(whfp)             |
+| poolformer_m48 | 82.46 | 95.96 | 73.4M   | 11.8G  | 224        | 0.95     | bicubic       | [google](https://drive.google.com/file/d/1YhXEVjWtI4bZB_Qwama8G4RBanq2K15L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VJXANTseTUEA0E6HYf-XyA)(374f)             |
+| | | | | | | | | |
+| botnet50 	 | 77.38	| 93.56	| 20.9M    | 5.3G   | 224        | 0.875     | bicubic       |[google](https://drive.google.com/file/d/1S4nxgRkElT3K4lMx2JclPevmP3YUHNLw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1CW40ShBJQYeFgdBIZZLSjg)(wh13)
+| | | | | | | | | |
+| CvT-13-224      | 81.59 | 95.67 | 20M    | 4.5G    | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1r0fnHn1bRPmN0mi8RwAPXmD4utDyOxEf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13xNwCGpdJ5MVUi369OGl5Q)(vev9) |
+| CvT-21-224      | 82.46 | 96.00 | 32M    | 7.1G    | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/18s7nRfvcmNdbRuEpTQe02AQE3Y9UWVQC/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mOjbMNoQb7X3VJD3LV0Hhg)(t2rv) |
+| CvT-13-384   	  | 83.00 | 96.36 | 20M    | 16.3G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1J0YYPUsiXSqyExBPtOPrOLL9c16syllg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1upITRr5lNHLjbBJtIr-jdg)(wswt) |
+| CvT-21-384   	  | 83.27 | 96.16 | 32M    | 24.9G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1tpXv_yYXtvyArlYi7AFcHUOqemhyMWHW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hXKi3Kb7mNxPFVmR6cdkMg)(hcem) |
+| CvT-13-384-22k  | 83.26 | 97.09 | 20M    | 16.3G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/18djrvq422u1pGLPxNfWAp6d17F7C5lbP/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YYv5rKPmroxKCnzkesUr0g)(c7m9) |
+| CvT-21-384-22k  | 84.91 | 97.62 | 32M    | 24.9G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1NVXd7vxVoRpL-21GN7nGn0-Ut0L0Owp8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1N3xNU6XFHb1CdEOrnjKuoA)(9jxe) |
+| CvT-w24-384-22k | 87.58 | 98.47 | 277M   | 193.2G  | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1M3bg46N4SGtupK8FcvAOE0jltOwP5yja/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1MNJurm8juHRGG9SAw3IOkg)(bbj2) |
+| | | | | | | | | |
+| HVT-Ti-1       | 69.45 | 89.28 | 5.7M    | 0.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/11BW-qLBMu_1TDAavlrAbfVlXB53dgm42/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16rZvJqL-UVuWFsCDuxFDqg?pwd=egds)(egds) |
+| HVT-S-0        | 80.30 | 95.15 | 22.0M   | 4.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1GlJ2j2QVFye1tAQoUJlgKTR_KELq3mSa/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L-tjDxkQx00jg7BsDClabA?pwd=hj7a)(hj7a) |
+| HVT-S-1        | 78.06 | 93.84 | 22.1M   | 2.4G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/16H33zNIpNrHBP1YhCq4zmLjRYQJ0XEmX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1quOsgVuxTcauISQ3SehysQ?pwd=tva8)(tva8) |
+| HVT-S-2        | 77.41 | 93.48 | 22.1M   | 1.9G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1U14LA7SXJtFep_SdUCjAV-cDOQ9A_OFk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nooWTBzaXyBtEgadn9VDmw?pwd=bajp)(bajp) |
+| HVT-S-3        | 76.30 | 92.88 | 22.1M   | 1.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1m1CjOcZfPMLDRyX4QBgMhHV1m6rtu44v/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15sAOmQN6Hx0GLelYDuMQXw?pwd=rjch)(rjch) |
+| HVT-S-4        | 75.21 | 92.34 | 22.1M   | 1.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/14comGo9lO12dUeGGL52MuIJWZPSit7I0/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1o31hMRWR7FTCjUk7_fAOgA?pwd=ki4j)(ki4j) |
+| | | | | | | | | |
+| | | | | | | | | |
+| mlp_mixer_b16_224            	| 76.60 | 92.23 | 60.0M   | 12.7G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ZcQEH92sEPvYuDc6eYZgssK5UjYomzUD/view?usp=sharing)/[baidu](https://pan.baidu.com/s/12nZaWGMOXwrCMOIBfUuUMA)(xh8x) |
+| mlp_mixer_l16_224           	| 72.06 | 87.67 | 208.2M  | 44.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1mkmvqo5K7JuvqGm92a-AdycXIcsv1rdg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AmSVpwCaGR9Vjsj_boL7GA)(8q7r) |
+| | | | | | | | | |
+| resmlp_24_224                	| 79.38 | 94.55 | 30.0M   | 6.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15A5q1XSXBz-y1AcXhy_XaDymLLj2s2Tn/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nLAvyG53REdwYNCLmp4yBA)(jdcx) |
+| resmlp_36_224             	| 79.77 | 94.89 | 44.7M   | 9.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1WrhVm-7EKnLmPU18Xm0C7uIqrg-RwqZL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QD4EWmM9b2u1r8LsnV6rUA)(33w3) |
+| resmlp_big_24_224         	| 81.04 | 95.02 | 129.1M  | 100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1KLlFuzYb17tC5Mmue3dfyr2L_q4xHTZi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1oXU6CR0z7O0XNwu_UdZv_w)(r9kb) |
+| resmlp_12_distilled_224 		| 77.95 | 93.56 | 15.3M   |	3.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1cDMpAtCB0pPv6F-VUwvgwAaYtmP8IfRw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15kJeZ_V1MMjTX9f1DBCgnw)(ghyp) |
+| resmlp_24_distilled_224 		| 80.76 | 95.22 | 30.0M   |	6.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15d892ExqR1sIAjEn-cWGlljX54C3vihA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NgQtSwuAwsVVOB8U6N4Aqw)(sxnx) |
+| resmlp_36_distilled_224 		| 81.15 | 95.48 | 44.7M	  | 9.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1Laqz1oDg-kPh6eb6bekQqnE0m-JXeiep/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1p1xGOJbMzH_RWEj36ruQiw)(vt85) |
+| resmlp_big_24_distilled_224 	| 83.59 | 96.65 | 129.1M  |	100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/199q0MN_BlQh9-HbB28RdxHj1ApMTHow-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yUrfbqW8vLODDiRV5WWkhQ)(4jk5) |
+| resmlp_big_24_22k_224   		| 84.40 | 97.11 | 129.1M  | 100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1zATKq1ruAI_kX49iqJOl-qomjm9il1LC/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VrnRMbzzZBmLiR45YwICmA)(ve7i) |
+| | | | | | | | | |
+| gmlp_s16_224                 	| 79.64 | 94.63 | 19.4M   | 4.5G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TLypFly7aW0oXzEHfeDSz2Va4RHPRqe5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13UUz1eGIKyqyhtwedKLUMA)(bcth) |
+| | | | | | | | | |
+| ff_only_tiny (linear_tiny) 	| 61.28 | 84.06 |         |        | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/14bPRCwuY_nT852fBZxb9wzXzbPWNfbCG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nNE4Hh1Nrzl7FEiyaZutDA)(mjgd) |
+| ff_only_base (linear_base) 	| 74.82 | 91.71 |         |        | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1DHUg4oCi41ELazPCvYxCFeShPXE4wU3p/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1l-h6Cq4B8kZRvHKDTzhhUg)(m1jc) |
+| | | | | | | | | |
+| repmlp_res50_light_224 		| 77.01 | 93.46 | 87.1M   | 3.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/16bCFa-nc_-tPVol-UCczrrDO_bCFf2uM/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1bzmpS6qJJTsOq3SQE7IOyg)(b4fg) |
+| | | | | | | | | |
+| cyclemlp_b1 					 | 78.85 | 94.60 | 15.1M   |    | 224   	    | 0.9    | bicubic       | [google](https://drive.google.com/file/d/10WQenRy9lfOJF4xEHc9Mekp4zHRh0mJ_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/11UQp1RkWBsZFOqit_uU80w)(mnbr) |
+| cyclemlp_b2 					 | 81.58 | 95.81 | 26.8M   |    | 224   	    | 0.9    | bicubic       | [google](https://drive.google.com/file/d/1dtQHCwtxNh9jgiHivN5iYpHe7uKRUjhk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Js-Oq5vyiB7oPagn43cn3Q)(jwj9) |
+| cyclemlp_b3 					 | 82.42 | 96.07 | 38.3M   |    | 224   	    | 0.9    | bicubic       | [google](https://drive.google.com/file/d/11kMq112tAwVE5llJIepIIixz74AjaJhU/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1b7cau1yPxqATA8X7t2DXkw)(v2fy) |
+| cyclemlp_b4 					 | 82.96 | 96.33 | 51.8M   |    | 224   	    | 0.875  | bicubic       | [google](https://drive.google.com/file/d/1vwJ0eD9Ic-NvLvCz1zEAmn7RxBMtd_v2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1P3TlnXRFGWj9nVP5xBGGWQ)(fnqd) |
+| cyclemlp_b5 					 | 83.25 | 96.44 | 75.7M   |    | 224   	    | 0.875  | bicubic       | [google](https://drive.google.com/file/d/12_I4cfOBfp7kC0RvmnMXFqrSxww6plRW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-Cka1tNqGUQutkAP3VZXzQ)(s55c) |
+| | | | | | | | | |
+| convmixer_1024_20  			| 76.94 | 93.35 | 24.5M   | 9.5G   |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/1R7zUSl6_6NFFdNOe8tTfoR9VYQtGfD7F/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DgGA3qYu4deH4woAkvjaBw)(qpn9) |
+| convmixer_768_32  			| 80.16 | 95.08 | 21.2M   | 20.8G  |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/196Lg_Eet-hRj733BYASj22g51wdyaW2a/view?usp=sharing)/[baidu](https://pan.baidu.com/s/17CbRNzY2Sy_Cu7cxNAkWmQ)(m5s5) |
+| convmixer_1536_20  			| 81.37 | 95.62 | 51.8M   | 72.4G  |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/1-LlAlADiu0SXDQmE34GN2GBhqI-RYRqO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1R-gSzhzQNfkuZVxsaE4vEw)(xqty) |
+| | | | | | | | | |
+| convmlp_s			  			| 76.76 | 93.40 | 9.0M    | 2.4G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1D8kWVfQxOyyktqDixaZoGXB3wVspzjlc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WseHYALFB4Of3Dajmlt45g)(3jz3) |
+| convmlp_m			  			| 79.03 | 94.53 | 17.4M   | 4.0G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TqVlKHq-WRdT9KDoUpW3vNJTIRZvix_m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koipCAffG6REUyLYk0rGAQ)(vyp1) |
+| convmlp_l			  			| 80.15 | 95.00 | 42.7M   | 10.0G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1KXxYogDh6lD3QGRtFBoX5agfz81RDN3l/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1f1aEeVoySzImI89gkjcaOA)(ne5x) |
+| | | | | | | | | |
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+### 目标检测 ###
+| Model | backbone  | box_mAP | Model                                                                                                                                                       |
+|-------|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| DETR  | ResNet50  | 42.0    | [google](https://drive.google.com/file/d/1ruIKCqfh_MMqzq_F4L2Bv-femDMjS_ix/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1J6lB1mezd6_eVW3jnmohZA)(n5gk) |
+| DETR  | ResNet101 | 43.5    | [google](https://drive.google.com/file/d/11HCyDJKZLX33_fRGp4bCg1I14vrIKYW5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1_msuuAwFMNbAlMpgUq89Og)(bxz2) |
+| Mask R-CNN | Swin-T 1x |  43.7   | [google](https://drive.google.com/file/d/1OpbCH5HuIlxwakNz4PzrAlJF3CxkLSYp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18HALSo2RHMBsX-Gbsi-YOw)(qev7) |
+| Mask R-CNN | Swin-T 3x |  46.0   | [google](https://drive.google.com/file/d/1oREwIk1ORhSsJcs4Y-Cfd0XrSEfPFP3-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tw607oogDWQ7Iz91ItfuGQ)(m8fg) |
+| Mask R-CNN | Swin-S 3x |  48.4   | [google](https://drive.google.com/file/d/1ZPWkz0zMzHJycHd6_s2hWDHIsW8SdZcK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ubC5_CKSq0ExQSINohukVg)(hdw5) |
+| Mask R-CNN | pvtv2_b0 		|  38.3   | [google](https://drive.google.com/file/d/1wA324LkFtGezHJovSZ4luVqSxVt9woFc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1q67ZIDSHn9Y-HU_WoQr8OQ)(3kqb) |
+| Mask R-CNN | pvtv2_b1 		|  41.8   | [google](https://drive.google.com/file/d/1alNaSmR4TSXsPpGoUZr2QQf5phYQjIzN/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1aSkuDiNpxdnFWE1Wn1SWNw)(k5aq) |
+| Mask R-CNN | pvtv2_b2 		|  45.2   | [google](https://drive.google.com/file/d/1tg6B5OEV4OWLsDxTCjsWgxgaSgIh4cID/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DLwxCZVZizb5HKih7RFw2w)(jh8b) |
+| Mask R-CNN | pvtv2_b2_linear 	|  44.1   | [google](https://drive.google.com/file/d/1b26vxK3QVGx5ovqKir77NyY6YPgAWAEj/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16T-Nyo_Jm2yDq4aoXpdnbg)(8ipt) |
+| Mask R-CNN | pvtv2_b3 		|  46.9   | [google](https://drive.google.com/file/d/1H6ZUCixCaYe1AvlBkuqYoxzz4b-icJ3u/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16QVsjUOXijo5d9cO3FZ39A)(je4y) |
+| Mask R-CNN | pvtv2_b4 		|  47.5   | [google](https://drive.google.com/file/d/1pXQNpn0BoKqiuVaGtJL18eWG6XmdlBOL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yhX7mpmb2wbRvWZFnUloBQ)(n3ay) |
+| Mask R-CNN | pvtv2_b5 		|  47.4   | [google](https://drive.google.com/file/d/12vOyw6pUfK1NdOWBF758aAZuaf-rZLvx/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-gasQk9PqLMkrWXw4aX41g)(jzq1) |
+
+### 目标分割 ###
+#### Pascal Context ####
+|Model      | Backbone  | Batch_size | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint      |     ConfigFile  |
+|-----------|-----------|------------|-----------|----------------|-----------------------------------------------|-----------------------------------------------------------------------|------------|
+|SETR_Naive | ViT_large |     16     |   52.06   |      52.57        | [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)     | [google](https://drive.google.com/file/d/1AUyBLeoAcMH0P_QGer8tdeU44muTUOCA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/11XgmgYG071n_9fSGUcPpDQ)(xdb8)   | [config](semantic_segmentation/configs/setr/SETR_Naive_Large_480x480_80k_pascal_context_bs_16.yaml) | 
+|SETR_PUP   | ViT_large |     16     |   53.90   |       54.53    | [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)     | [google](https://drive.google.com/file/d/1IY-yBIrDPg5CigQ18-X2AX6Oq3rvWeXL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1v6ll68fDNCuXUIJT2Cxo-A)(6sji) | [config](semantic_segmentation/configs/setr/SETR_PUP_Large_480x480_80k_pascal_context_bs_16.yaml) |
+|SETR_MLA   | ViT_Large |     8      |   54.39   |       55.16       | [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)     | [google](https://drive.google.com/file/d/1utU2h0TrtuGzRX5RMGroudiDcz0z6UmV/view)/[baidu](https://pan.baidu.com/s/1Eg0eyUQXc-Mg5fg0T3RADA)(wora)| [config](semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml) |
+|SETR_MLA   | ViT_large |     16     |   55.01   |       55.87        | [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)     | [google](https://drive.google.com/file/d/1SOXB7sAyysNhI8szaBqtF8ZoxSaPNvtl/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jskpqYbazKY1CKK3iVxAYA)(76h2) | [config](semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_16.yaml) |
+
+#### Cityscapes ####
+|Model      | Backbone  | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint     |     ConfigFile  |
+|-----------|-----------|------------|-----------|-----------|----------------|-----------------------------------------------|-----------------------------------------------------------------------|------------|
+|SETR_Naive | ViT_Large |     8      |     40k   |   76.71   |       79.03        | [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)      | [google](https://drive.google.com/file/d/1QialLNMmvWW8oi7uAHhJZI3HSOavV4qj/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1F3IB31QVlsohqW8cRNphqw)(g7ro)  |  [config](semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_40k_cityscapes_bs_8.yaml)| 
+|SETR_Naive | ViT_Large |     8      |     80k   |   77.31   |       79.43      | [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)      | [google](https://drive.google.com/file/d/1RJeSGoDaOP-fM4p1_5CJxS5ku_yDXXLV/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1XbHPBfaHS56HlaMJmdJf1A)(wn6q)   |  [config](semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_80k_cityscapes_bs_8.yaml)| 
+|SETR_PUP   | ViT_Large |     8      |     40k   |   77.92   |       79.63        |  [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)     | [google](https://drive.google.com/file/d/12rMFMOaOYSsWd3f1hkrqRc1ThNT8K8NG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1H8b3valvQ2oLU9ZohZl_6Q)(zmoi)    | [config](semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_40k_cityscapes_bs_8.yaml)| 
+|SETR_PUP   | ViT_Large |     8      |     80k   |   78.81   |       80.43     |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)    | [baidu](https://pan.baidu.com/s/1tkMhRzO0XHqKYM0lojE3_g)(f793)    | [config](semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_80k_cityscapes_bs_8.yaml)| 
+|SETR_MLA   | ViT_Large |     8      |     40k   |   76.70    |       78.96      |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)    | [baidu](https://pan.baidu.com/s/1sUug5cMKSo6mO7BEI4EV_w)(qaiw)    | [config](semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_40k_cityscapes_bs_8.yaml)| 
+|SETR_MLA   | ViT_Large |     8      |     80k   |  77.26     |       79.27      |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)    | [baidu](https://pan.baidu.com/s/1IqPZ6urdQb_0pbdJW2i3ow)(6bgj)    | [config](semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_80k_cityscapes_bs_8.yaml)| 
+
+
+#### ADE20K ####
+|Model      | Backbone  | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint     |     ConfigFile  |
+|-----------|-----------|------------|-----------|-----------|----------------|-----------------------------------------------|-----------------------------------------------------------------------|------------|
+|SETR_Naive | ViT_Large |     16      |     160k   | 47.57   |      48.12        |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)    | [baidu](https://pan.baidu.com/s/1_AY6BMluNn71UiMNZbnKqQ)(lugq)   | [config](semantic_segmentation/configs/setr/SETR_Naive_Large_512x512_160k_ade20k_bs_16.yaml)| 
+|SETR_PUP   | ViT_Large |     16      |     160k   |  49.12   |      49.51        |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)    | [baidu](https://pan.baidu.com/s/1N83rG0EZSksMGZT3njaspg)(udgs)    | [config](semantic_segmentation/configs/setr/SETR_PUP_Large_512x512_160k_ade20k_bs_16.yaml)| 
+|SETR_MLA   | ViT_Large |     8      |     160k   |  47.80   |       49.34        |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)    | [baidu](https://pan.baidu.com/s/1L83sdXWL4XT02dvH2WFzCA)(mrrv)    | [config](semantic_segmentation/configs/setr/SETR_MLA_Large_512x512_160k_ade20k_bs_8.yaml)| 
+|DPT        | ViT_Large |     16     |     160k   |  47.21   |       -        |   [google](https://drive.google.com/file/d/1TPgh7Po6ayYb1DksJeZp60LGnNyznr-r/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18WSi8Jp3tCZgv_Vr3V1i7A)(owoj)      |[baidu](https://pan.baidu.com/s/1PCSC1Kvcg291gqp6h5pDCg)(ts7h)   |  [config](semantic_segmentation/configs/dpt/DPT_Large_480x480_160k_ade20k_bs_16.yaml)
+|Segmenter  | ViT_Tiny  |     16     |     160k   |  38.45   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1nZptBc-IY_3PFramXSlovQ)(1k97)   |  [config](semantic_segmentation/configs/segmenter/segmenter_Tiny_512x512_160k_ade20k_bs_16.yaml)
+|Segmenter  | ViT_Small |     16     |     160k   |  46.07   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1gKE-GEu7gX6dJsgtlvrmWg)(i8nv)   |  [config](semantic_segmentation/configs/segmenter/segmenter_small_512x512_160k_ade20k_bs_16.yaml)
+|Segmenter  | ViT_Base  |     16     |     160k   |  49.08   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1qb7HEtKW0kBSP6iv-r_Hjg)(hxrl)   |  [config](semantic_segmentation/configs/segmenter/segmenter_Base_512x512_160k_ade20k_bs_16.yaml) |
+|Segmenter  | ViT_Large  |     16     |     160k   |  51.82   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/121FOwpsYue7Z2Rg3ZlxnKg)(wdz6)   |  [config](semantic_segmentation/configs/segmenter/segmenter_Tiny_512x512_160k_ade20k_bs_16.yaml)
+|Segmenter_Linear  | DeiT_Base |     16     |     160k   |  47.34   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1Hk_zcXUIt_h5sKiAjG2Pog)(5dpv)   |  [config](semantic_segmentation/configs/segmenter/segmenter_Base_distilled_512x512_160k_ade20k_bs_16.yaml)
+|Segmenter  | DeiT_Base |     16     |     160k   |  49.27   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1-TBUuvcBKNgetSJr0CsAHA)(3kim)   |  [config](semantic_segmentation/configs/segmenter/segmenter_Base_distilled_512x512_160k_ade20k_bs_16.yaml) |
+|Segformer  | MIT-B0 |     16     |     160k   |  38.37   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1WOD9jGjQRLnwKrRYzgBong)(ges9)   |  [config](semantic_segmentation/configs/segformer/segformer_mit-b0_512x512_160k_ade20k.yaml) |
+|Segformer  | MIT-B1 |     16     |     160k   |  42.20   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1aiSBXMd8nP82XK7sSZ05gg)(t4n4)   |  [config](semantic_segmentation/configs/segmenter/segformer_mit-b1_512x512_160k_ade20k.yaml) |
+|Segformer  | MIT-B2 |     16     |     160k   |  46.38   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1wFFh-K5t46YktkfoWUOTAg)(h5ar)   |  [config](semantic_segmentation/configs/segmenter/segformer_mit-b2_512x512_160k_ade20k.yaml) |
+|Segformer  | MIT-B3 |     16     |     160k   |  48.35   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1IwBnDeLNyKgs-xjhlaB9ug)(g9n4)   |  [config](semantic_segmentation/configs/segmenter/segformer_mit-b3_512x512_160k_ade20k.yaml) |
+|Segformer  | MIT-B4 |     16     |     160k   |  49.01   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/1a25fCVlwJ-1TUh9HQfx7YA)(e4xw)   |  [config](semantic_segmentation/configs/segmenter/segformer_mit-b4_512x512_160k_ade20k.yaml) |
+|Segformer  | MIT-B5 |     16     |     160k   |  49.73   |       -        |   TODO      |[baidu](https://pan.baidu.com/s/15kXXxKEjjtJv-BmrPnSTOw)(uczo)   |  [config](semantic_segmentation/configs/segmenter/segformer_mit-b5_512x512_160k_ade20k.yaml) |
+| UperNet  | Swin_Tiny |     16     |     160k   |  44.90   |       45.37     |   -      |[baidu](https://pan.baidu.com/s/1S8JR4ILw0u4I-DzU4MaeVQ)(lkhg)   |  [config](semantic_segmentation/configs/upernet_swin/upernet_swin_tiny_patch4_windown7_512x512_160k_ade20k.yaml) |
+| UperNet  | Swin_Small |     16     |     160k   |  47.88   |       48.90      |   -      |[baidu](https://pan.baidu.com/s/17RKeSpuWqONVptQZ3B4kEA)(vvy1)   |  [config](semantic_segmentation/configs/upernet_swin/upernet_swin_small_patch4_windown7_512x512_160k_ade20k.yaml) |
+| UperNet  | Swin_Base |     16     |     160k   |   48.59   |       49.04      |   -      |[baidu](https://pan.baidu.com/s/1bM15KHNsb0oSPblQwhxbgw)(y040)   |  [config](semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Tiny |     16     |     160k   |  49.46   |           |[baidu](https://pan.baidu.com/s/1ol_gykZjgAFbJ3PkqQ2j0Q)(l1cp) | [baidu](https://pan.baidu.com/s/1gLePNLybtrax9yCQ2fcIPg)(y1eq)  |  [config](seman}tic_segmentation/configs/upernet_cswin/upernet_cswin_tiny_patch4_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Small |     16     |     160k   |  50.88   |      | [baidu](https://pan.baidu.com/s/1mSd_JdNS4DtyVNYxqVobBw)(6vwk)   | [baidu](https://pan.baidu.com/s/1a_vhHoib0-BcRwTnnSVGWA)(fz2e)   | [config](semantic_segmentation/configs/upernet_cswin/upernet_cswin_small_patch4_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Base |     16     |     160k   |  50.64   |      | [baidu](https://pan.baidu.com/s/1suO0jX_Tw56CVm3UhByOWg)(0ys7)   | [baidu](https://pan.baidu.com/s/1Ym-RUooqizgUDEm5jWyrhA)(83w3)   | [config](semantic_segmentation/configs/upernet_cswin/upernet_cswin_base_patch4_512x512_160k_ade20k.yaml) |
+
+#### Trans10kV2 ####
+|Model      | Backbone  | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint     |     ConfigFile  |
+|-----------|-----------|------------|-----------|-----------|----------------|-----------------------------------------------|-----------------------------------------------------------------------|------------|
+|Trans2seg_Medium | Resnet50c |     16      |    16k    |  75.97  |      -        |   [google](https://drive.google.com/file/d/1C6nMg6DgQ73wzF21UwDVxmkcRTeKngnK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hs0tbSGIeMLLGMq05NN--w)(4dd5)    | [google](https://drive.google.com/file/d/1C6nMg6DgQ73wzF21UwDVxmkcRTeKngnK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wdOUD6S8QGqD6S-98Yb37w)(w25r)   | [config](semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_16k_trans10kv2_bs_16.yaml)| 
+
+### GAN ###
+| Model                          | FID | Image Size | Crop_pct | Interpolation | Model        |
+|--------------------------------|-----|------------|----------|---------------|--------------|
+| styleformer_cifar10            |2.73 | 32         | 1.0      | lanczos       |[google](https://drive.google.com/file/d/1iW76QmwbYz6GeAPQn8vKvsG0GvFdhV4T/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Ax7BNEr1T19vgVjXG3rW7g)(ztky)  |
+| styleformer_stl10              |15.65| 48         | 1.0      | lanczos       |[google](https://drive.google.com/file/d/15p785y9eP1TeoqUcHPbwFPh98WNof7nw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rSORxMYAiGkLQZ4zTA2jcg)(i973)|
+| styleformer_celeba             |3.32 | 64         | 1.0      | lanczos       |[google](https://drive.google.com/file/d/1_YauwZN1osvINCboVk2VJMscrf-8KlQc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16NetcPxLQF9C_Zlp1SpkLw)(fh5s) |
+| styleformer_lsun               | 9.68 | 128        | 1.0      | lanczos       |[google](https://drive.google.com/file/d/1i5kNzWK04ippFSmrmcAPMItkO0OFukTd/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jTS9ExAMz5H2lhue4NMV2A)(158t)|
+> *使用**fid50k_full**指标在 Cifar10, STL10, Celeba 以及 LSUNchurch 数据集上评估结果.
+
+
+## 图像分类的快速示例
+如果需要使用模型预训练权重，需要转到对应子文件夹，例如， `/image_classification/ViT/`, 然后下载 `.pdparam` 权重文件并在python脚本中更改相关文件路径。模型的配置文件位于`.、configs/`.  
+
+假设下载的预训练权重文件存储在`./vit_base_patch16_224.pdparams`, 在python中使用`vit_base_patch16_224`模型:
+```python
+from config import get_config
+from visual_transformer import build_vit as build_model
+# config files in ./configs/
+config = get_config('./configs/vit_base_patch16_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./vit_base_patch16_224')
+model.set_dict(model_state_dict)
+```
+> :robot: 详细用法庆参见每个模型对应文件夹中的README文件.
+
+
+### 评估 ###
+如果在单GPU上评估ViT模型在ImageNet2012数据集的性能，请使用命令行运行以下脚本：
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/vit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+使用多GPU运行评估
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/vit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed
+```
+
+</details>
+
+
+### 训练 ###
+如果使用单GPU在ImageNet2012数据集训练ViT模型，请使用命令行运行以下脚本：
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/vit_base_patch16_224.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=32 \
+  -data_path=/path/to/dataset/imagenet/train
+```
+
+
+<details>
+
+<summary>
+使用多GPU运行训练：
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/vit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+</details>
+
+
+
+## 贡献 ##
+* 我们鼓励并感谢您对 **PaddleViT** 项目的贡献, 请查看[CONTRIBUTING.md](./CONTRIBUTING.md)以参考我们的工作流程和代码风格.  
+
+
+## 许可 ##
+* 此 repo 遵循 Apache-2.0 许可. 
+
+## 联系 ##
+如果您有任何问题, 请在我们的Github上创建一个[issue](https://github.com/BR-IDL/PaddleViT/issues).
+
diff --git a/docs/coco_dataset_1.png b/docs/coco_dataset_1.png
new file mode 100644
index 00000000..5960b265
Binary files /dev/null and b/docs/coco_dataset_1.png differ
diff --git a/docs/coco_dataset_2.png b/docs/coco_dataset_2.png
new file mode 100644
index 00000000..da0891a7
Binary files /dev/null and b/docs/coco_dataset_2.png differ
diff --git a/docs/paddlevit-amp-cn.md b/docs/paddlevit-amp-cn.md
new file mode 100644
index 00000000..332643f6
--- /dev/null
+++ b/docs/paddlevit-amp-cn.md
@@ -0,0 +1,45 @@
+简体中文 | [English](./paddlevit-amp.md)
+
+# PaddleViT:如何使用自动混合精度(AMP)训练 ？
+
+## Introduction:
+PaddleViT对于单gpu和多gpu均支持AMP训练。简而言之，自动混合精度(AMP)训练是指在训练模型期间同时使用全精度(FP32)和半精度(FP16)的过程。目的在于保持准确性的同时也能加快训练速度。关于Paddle AMP的更多教程可以参考[here](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html)以及NVIDIA官方网站[here](https://developer.nvidia.com/automatic-mixed-precision).
+
+> 注意: 只有 Nvidia Ampere, Volta 以及 Turing GPUs 支持 FP16 计算. 
+
+
+## PaddleViT AMP training:
+PaddleViT提供了面向视觉任务的amp训练的简单易实现的方式。例如，类似于图像分类任务中的标准训练脚本，添加输入参数 `-amp` 即可切换到amp训练模式。
+
+对于 single-GPU 训练:
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/vit_base_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-amp
+```
+
+对于 multi-GPU 训练:
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/vit_base_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-amp
+```
+
+## Benchmark
+我们在单GPU (Nvidia V100)上分别测试使用 `amp`和不使用 `amp`两种情况下的ViT基础模型的训练速度，结果如下表所示：
+
+|         | 1st | 2nd | 3rd | 4th | 5th | 6th | 7th | Average | Speedup |
+|---------|-----|-----|-----|-----|-----|-----|-----|---------|---------|
+| AMP off | 78s | 78s | 79s | 78s | 78s | 79s | 78s | 78.29s  |    -    |
+| AMP on  | 42s | 41s | 41s | 41s | 41s | 41s | 41s | 41.14s  | 1.903   |
+
+在上表中，每一项代表100次训练迭代的训练时间（以秒为单位）。可以看出，`amp off` 和 `amp on` 的平均训练时间分别为 78.29 s/100iter 和 41.14 s/100iter，训练速度提升约**1.9**倍。
diff --git a/docs/paddlevit-amp.md b/docs/paddlevit-amp.md
new file mode 100644
index 00000000..518c7b9f
--- /dev/null
+++ b/docs/paddlevit-amp.md
@@ -0,0 +1,43 @@
+English | [简体中文](./paddlevit-amp-cn.md)
+
+# PaddleViT: How to use Automatic Mixed Precision (AMP) Training?
+
+## Introduction:
+PaddleViT supports AMP training for both single-gpu and multi-gpu settings. Briefly, automatic mixed precision (AMP) training is the process of using both full precision (a.k.a. FP32) and half precision (a.k.a FP16) during the model training. The aim is to speed up training while maintaining the accuracy. More information can be found in the Paddle AMP tutorial [here](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html) and NVIDIA official website [here](https://developer.nvidia.com/automatic-mixed-precision).
+
+> Note: only Nvidia Ampere, Volta and Turing GPUs are supported FP16 computing. 
+
+## PaddleViT AMP training:
+PaddleViT provides very simple implementations to enable amp training for vision tasks. For example, similar to the standard training script in image classification, adding the input argument `-amp` will switch to the amp training mode.
+
+For single-GPU training:
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/vit_base_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-amp
+```
+
+For multi-GPU training:
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/vit_base_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-amp
+```
+
+## Benchmark
+We test the training speed on a single GPU (Nvidia V100) for the ViT base model with and without `amp`, and the results are shown in the following table:
+|         | 1st | 2nd | 3rd | 4th | 5th | 6th | 7th | Average | Speedup |
+|---------|-----|-----|-----|-----|-----|-----|-----|---------|---------|
+| AMP off | 78s | 78s | 79s | 78s | 78s | 79s | 78s | 78.29s  |    -    |
+| AMP on  | 42s | 41s | 41s | 41s | 41s | 41s | 41s | 41.14s  | 1.903   |
+
+In the above table, each item represents the training time in seconds for 100 training iterations. It can be see that the average training time is 78.29s/100iter, and 41.14s/100iter, for `amp off` and `amp on`, respectively. The training speed is increased by about **1.9** times.
\ No newline at end of file
diff --git a/docs/paddlevit-coco-cn.md b/docs/paddlevit-coco-cn.md
new file mode 100644
index 00000000..c1b2df11
--- /dev/null
+++ b/docs/paddlevit-coco-cn.md
@@ -0,0 +1,370 @@
+简体中文 | [English](./paddlevit-coco.md)
+
+# PaddleViT 教程: 用于目标检测的COCO数据集
+[COCO dataset](https://cocodataset.org/#home) 是计算机视觉领域中最流行的数据集之一，用于对各种视觉任务进行基准测试，例如目标检测、分割、关键点检测等。在本教程中，我们介绍了加载和处理COCO数据集以及进行目标检测的详细PaddleViT实现。我们将完成从使用`pycocotools`实现`CocoDataset`，到在`transforms`中应用于训练和评估的增强细节的实现。
+
+本教程是开源项目[PaddleViT](../../)的一部分。
+
+## Installation
+需要安装pycocotools：
+
+* pycocotools
+    ```shell
+    pip install pycocotools
+    ```
+## Download:
+COCO数据集可以在[COCO official website](https://cocodataset.org/#download)下载。
+
+请注意，对于目标检测，我们使用自2017年保持不变的`COCO2017`数据集。
+
+在数据集中，有`118K`张图像用于训练，`5K`张图像用于验证。下载数据集后，目录中内容如下：
+
+```
+COCO dataset folder
+├── annotations
+│   ├── captions_train2017.json
+│   ├── captions_val2017.json
+│   ├── instances_train2017.json
+│   ├── instances_val2017.json
+│   ├── person_keypoints_train2017.json
+│   └── person_keypoints_val2017.json
+├── train2017
+│   ├── 000000000009.jpg
+│   ├── 000000000025.jpg
+│   ├── 000000000030.jpg
+│   ├── 000000000034.jpg
+|   ...
+└── val2017
+    ├── 000000000139.jpg
+    ├── 000000000285.jpg
+    ├── 000000000632.jpg
+    ├── 000000000724.jpg
+    ...
+```
+## COCO Annotations:
+本节中我们介绍了COCO标注的基础信息，在大多数情况下, COCO API 可以用于帮助我们从复杂的json注释文件中轻松访问数据和标签。更多详情请参考[官方文档](https://cocodataset.org/#format-data)。
+
+
+例如`instances_train2017.json`的数据结构如下：
+
+```json
+{
+    "info": {
+        "description": "COCO 2017 Dataset",
+        "url": "http://cocodataset.org",
+        "version": "1.0",
+        ...
+    },
+    "licenses": {
+        {
+            "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
+            "id": 1,
+            "name": "Attribution-NonCommercial-ShareAlike License"
+        },
+        ...
+    },
+    "images": [
+        {
+            "license": 4,
+            "file_name": "000000397133.jpg",
+            "coco_url": "http://images.cocodataset.org/val2017/000000397133.jpg",
+            "height": 427,
+            "width": 640,
+            "date_captured": "2013-11-14 17:02:52",
+            "flickr_url": "http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg",
+            "id": 397133
+        },
+        ...
+    ],
+    "annotations": [
+        {
+            "segmentation": RLE or [polygon],
+            "area": float,
+            "iscrowd": 0 or 1,
+            "image_id": int,
+            "bbox": [x, y, width, height],
+            "category_id": int,
+            "id": int
+        },
+        ...
+    ],
+    "categories": [
+        {
+            "supercategory": str,
+            "id": int,
+            "name": str
+        },
+        ...
+    ]
+}
+```
+
+### `images`
+`images`字段包含训练集的图像信息，如`filename`, `width`, `height`, 以及 `id`。其中， `id`对于每个图像都是唯一的，用于索引数据集中的图像数据。
+
+### `categories` 
+`categories`字段包含 class/label 名称作为字符串，每一个类别都分配了唯一的类别`id`以便于访问.
+
+### `annotations`
+`annotations`字段包含所有的**object instances**, 每一个实例都标有一系列注释。
+
+> 注意：目标实例的数量通常大于图像的水昂，因为一张图像中通常有多个目标。 
+
+每一个`annotation`都有以下字段：
+#### -`id`
+  * int, 实例 id, 每个注释都有唯一的id.
+#### -`image_id`
+  * int, 用于标识当前目标属于哪一张图像.
+#### -`category_id`
+  * int, 用于识别类别.
+#### -`bbox`
+  * [x, y, width, height], 边界框坐标.
+  
+    格式为 **[box top-left corner x, box top-left corner y, box width, box height]**. 请注意，[0,0]坐标是图像的左上角。
+  
+#### -`iscrowd`
+  * 0 or 1, `iscrowd=1` 用于标记一大群人。
+#### -`segmentation`
+  * `RLE` or `[polygon]`, if `iscrowd=0`, return `[polygon]`.
+    
+     `[polygon]`是目标掩码的一组点，用于单个目标。格式为 `[x0, y0, x1, y1, x2, y2, ...]`.
+
+     `RLE(Run Length Encoding)` 用于一组目标， `RLE` 格式为:
+    ```json
+    segmentation:
+    {
+        "counts": [179, 27, 392 ...],
+        "size": [
+            426,
+            640,
+        ]
+    }
+    ```
+    RLE是一种用于表示每个像素属于前景还是背景的编码方式。 `size`存储图像的长度和高度。 `counts`连续存储前景或背景中的像素数量。
+
+    例如，我们有以下图像和掩码：
+    ![img](./coco_dataset_1.png)
+    RLE编码对属于背景的像素数进行计数（从左上角开始，逐行），直到遇到前景像素，将这个数字存储在`counts`中，然后计算前景像素的数量并存储在`counts`中。
+ 
+    ![img](./coco_dataset_2.png)
+
+
+
+> 更加直观的视频说明详见[here](https://www.youtube.com/watch?v=h6s61a_pqfM).
+
+在大多数情况下，我们在为模型创建训练数据集时无需担心注释格式。 `COCO API`为我们提供了一系列的api函数，方便我们获取任务的图像数据和目标标签。
+
+## PaddleViT: COCODataset
+COCO数据集有一个名为`pycocotools`的 python API，供用户轻松加载和使用COCO数据集进行检测、分割和其他cv任务。 在本节中，我们将基于 `pycocotools` 实现COCO检测数据集的PaddleViT实现，并用于训练和验证。
+
+
+### `CocoDataset` Class
+`CocoDataset` 类由 `paddle.io.Dataset` 类实现, 并需要两个函数 `__getitem__` 与 `__len__` , 即:
+```python
+class CocoDetection(paddle.io.Dataset):
+    def __init__(self, image_folder, anno_file, transforms, return_mask):
+        super().__init__()
+        ...
+    def __getitem__(self, idx):
+        ...
+    def __len__(self):
+        ...
+```
+
+#### `__init__` method
+在类的初始化方法中：
+1. 通过调用pycocotools api加载coco数据集的anno文件。
+2. 获取图像id并删除没有注释的图像。
+3. 通过init参数设置数据转换（预处理器）。
+4. 定义标签转换方法。（详情见下节）
+
+```python
+from pycocotools.coco import COCO
+...
+class CocoDataset():
+    def __init__(self):
+        super().__init__()
+        # step1
+        self.coco = COCO(anno_file)
+        # step2
+        ids = list(sorted(self.coco.imgs.keys()))
+        self.ids = self._remove_images_without_annotations(ids)
+        # step3
+        self._transforms = transforms
+        # step4
+        self.prepare = ConvertCocoPolysToMasks(return_masks)
+        self.root = img_folder
+```
+
+
+#### `__getitem__` method
+`__getitem__`方法将索引作为输入，并输出包含单张图像及其目标标签的`(image, target)` 对。在coco检测中，这个目标是一个类似于以下形式的 `dict` :
+```
+target = {'image_id': image_id, 'annotations': target}
+```
+> `image_id` 是在coco注释中相同的图像id.
+
+> `target` 是键值对的字典，例如 `bbox` 和 `mask`. （英文版单词拼写错误）
+
+`__getitem__` 方法定义：
+1. 使用COCO API加载指定的图像及其标签
+2. 转换标签（如将掩码从多边形转换为掩码数组）
+3. 输入数据的预处理转换
+
+```python
+def __getitem__(self, idx):
+    image_id = self.ids[idx]
+    image = self._load_image(image_id)
+    target = self._load_target(image_id)
+    target = {'image_id': image_id, 'annotations': target}
+
+    image, target = self.prepare(image, target)
+    if self._transform is not None:
+        image, target = self._transform(image, target)
+    return image, target
+```
+
+
+
+#### `__len__` method
+返回数据集中的样本数，与`ids`长度相同：
+
+```python
+def __len__(self):
+    return len(self.ids)
+```
+
+
+#### `_load_image`, `_load_target` methods
+`PIL.Image` 和 `COCO API` 用于根据给定索引获取图像数据和原始目标标签.
+```python
+def _load_image(self, idx):
+    """ Return PIL Image (RGB) according to COCO image id"""
+    path = self.coco.loadImgs(idx)[0]['file_name']
+    return Image.open(os.path.join(self.root, path)).convert('RGB')
+  
+def _load_target(self, idx):
+    """ Return image annos according to COCO image id"""
+    return self.coco.loadAnns(self.coco.getAnnIds(idx))
+```
+   
+### `ConvertCocoPolysToMasks` Class
+该类定义了以图像和标签为输入并输出图像数组和处理后的标签。
+This class defines class calls that takes image and label as input and outputs image array and processed labels.
+
+专门对于目标标签的处理：
+1. 去掉`iscrowd=1`的图像；
+2. 将`[x1, y1, x2, y2]`中的包围框转换为numpy数组类型，然后根据包围框裁剪图像；
+3. 将类标签转换为numpy数组；
+4. 如果返回掩码（对于分割任务），使用coco api将多边形数据转换为掩码数组；
+5. 如果返回关键点（用于关键点检测），则将关键点加载到数组中；
+6. 消除面积为0的包围框；
+7. 将处理后的标签保存在`target`字典中。
+
+> 注意：我们使用numpy数组而不是paddle张量，因为当前paddlepaddle可能会在使用GPU张量的数据压缩中引起错误。
+
+详细的实现可以在源代码中找到[here](https://github.com/BR-IDL/PaddleViT/blob/5ba4761845f06f66ba3a89907e0769e1850bcab2/object_detection/DETR/coco.py#L117).
+
+### `Transforms` Module
+在转换模块(`transforms.py`)中定义了多种数据压缩方法。 定义我们自己的模块而不是使用paddle视觉转换的原因是，每个数据变换都必须应用于图像数据集其目标标签，例如bbox和掩码。假设在训练期间对图像数据应用类随机裁剪操作，则该图像中的bbox必需应用相同的裁剪。
+
+#### Validation transforms
+DETR 的验证转换具有以下操作：
+* `RandomResize()`: 将图像和标签调整为具有相同比例的特定大小。
+* `ToTensor()`: 将图像数据转换为 `paddle.Tensor`
+* `Normalize()`: 均值$-mean$和$/std$
+
+#### Training transforms
+DETR的训练转换具有以下操作：
+
+* `RandomHorizontalFlip()` 随机水平翻转数据。
+* `RandomSelect()` 随机选择两个子操作之一： (1) 一个单个 `RandomResize` 步骤; (2) 一个 三步骤操作: `RandomReize`, `RandomSizeCrop`, 以及 `RandomResize`
+* `ToTensor()`: 将图像数据转换为 `paddle.Tensor`
+* `Normalize()`: 图像数据标准化， $-mean$ 和 $/std$
+
+#### `RandomHorizontalFlip()`
+此变换需要初始化参数中的概率用来控制是否应用反转的随机性。
+
+```
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image, target):
+        if random.random() < self.p:
+            return hflip(image, target)
+        return image, target
+```
+
+`hflip` 方法定义了图像和目标（包含包围框和盐吗的真实标注值的字典）的水平翻转操作。
+
+#### `RandomSelect()`
+`RandomSelect()`有一个prob值控制选择它的两个子操作之一的随机性。
+```python
+class RandomSelect():
+    """ Random select one the transforms to apply with probablity p"""
+    def __init__(self, transforms1, transforms2, p=0.5):
+        self.transforms1 = transforms1
+        self.transforms2 = transforms2
+        self.p = p
+
+    def __call__(self, image, target):
+        if random.random() > self.p:
+            return self.transforms1(image, target)
+        return self.transforms2(image, target)
+ ```
+
+两个转换操作在DETR训练中使用：
+ - `RandomResize()`
+ - `RandomResize()` + `RandomSizeCrop()` + `RandomResize()`
+
+
+
+#### `RandomResize()`
+`RandomResize`有两个参数：`sizes` 和 `max_size`. 该方法随机选择`sizes`中的一个值作为图像短边的目标尺寸，同时保持图像的比例不变。但是，如果图像的长边大于`max_size`（当使用所选尺寸作为短边时），则将图像的长边设置为`max_size`，而较短的尺寸需要重新计算以保持图像长宽比例不变。
+
+必须在bbox和掩码使用相同的尺寸调整操作。 通过乘以高度和宽度的比例可以转换包围框。可以通过插值和二值化来转换掩码以获得缩放掩码（如果 values > 0.5则设置为1，否则设置为0）。
+
+#### `RandomSizeCrop()`
+`RandomSizeCrop` 将`min_size`和`max_size` 作为输入，然后将裁减图像中的随机区域作为输出。输出区域的尺寸为 `[randint(min_size, max_size), randint(min_size, max_size)]`.
+
+`RandomSizeCrop` 分为三个步骤实现:
+* STEP1: 给定 `min_size`, `max_size` 和原始图像尺寸，生成随机图像宽度和图像高度。
+* STEP2: 给定裁剪后的图像大小，随机选择图像内裁减区域的位置。这个区域可以用 `[top, left, height, width]`表示.
+* STEP3: 给定裁剪区域，裁剪图像和目标的标签，例如 包围框和掩码.
+
+具体来说，我们实现了一个`crop`方法，其输入(1)在`[top, left, height, width]`中的裁剪区域，(2) 原始图像 以及 (3) 目标标签，然后返回裁剪后的图像和裁剪后的标签。请注意，在裁剪之后，原始包围框或者掩码也会被裁剪，甚至在裁剪后的图像中看不到，因此，我们必须从目标标签中消除那些无效的框和掩吗。
+
+#### `ToTensor()`
+`ToTensor` 将图像数据从PIL.Image转换为paddle.Tensor, 返回图像张量和相应的标签，通过以下方式可以实现：
+```python
+import paddle.vision.transforms as T
+class ToTensor:
+    def __call__(self, image, target):
+        return T.to_tensor(image), target
+```
+
+#### `Normalize()`
+在 `Normalize`方法中, 除了数据归一化(-mean & /std), 我们还将包围框从 `[x0, y0, x1, y1]` 归一化为 `[cx, cy, w, h]`, 根据图像尺寸归一化为相对坐标. 实现方式如下:
+```python
+class Normalize():
+    def __init__(self, mean, std):
+        self.mean = mean
+        self.std = std
+
+    def __call__(self, image, target=None):
+        # -mean, / std
+        image = T.functional.normalize(image, mean=self.mean, std=self.std)
+        if target is None:
+            return image, None
+        target = target.copy()
+        # from xyxy -> cxcywh -> relative coords
+        h, w = image.shape[-2:]
+        if 'boxes' in target and target['boxes'].shape[0] != 0:
+            boxes = target['boxes']
+            boxes = box_xyxy_to_cxcywh_numpy(boxes)
+            boxes = boxes / np.array([w, h, w, h], dtype='float32')
+            target['boxes'] = boxes
+
+        return image, target
+```
diff --git a/docs/paddlevit-coco.md b/docs/paddlevit-coco.md
new file mode 100644
index 00000000..328c2e3b
--- /dev/null
+++ b/docs/paddlevit-coco.md
@@ -0,0 +1,359 @@
+English | [简体中文](./paddlevit-coco-cn.md)
+
+# PaddleViT Tutorial: COCO Datast for Object Detection
+[COCO dataset](https://cocodataset.org/#home) is one of the most popular datasets in computer vision community for benchmarking a variety of vision tasks such as object detection, segmentation, and keypoint detection, etc. In this tutorial, we present the detailed PaddleViT implementations of loading and processing COCO dataset for object detection. We will go through the whole procedures from implementing our `CocoDataset` by utilizing `pycocotools`, to the augmentation details that applied for both training and evaluation in `transforms`.
+
+This tutorial is part of the open source project [PaddleViT](../../).
+
+
+## Installation
+It is required to install the following package:
+* pycocotools
+    ```shell
+    pip install pycocotools
+    ```
+## Download:
+The COCO dataset can be downloaded from [COCO official website](https://cocodataset.org/#download).
+
+Note that for object detection, we are using the socalled `COCO2017` dataset which stayed unchanged since 2017. 
+
+In the dataset, there are 118K (`118,287`) images used for training and 5K (`5,000`) images for validation. Once the dataset is downloaded, you should have the following directories:
+
+```
+COCO dataset folder
+├── annotations
+│   ├── captions_train2017.json
+│   ├── captions_val2017.json
+│   ├── instances_train2017.json
+│   ├── instances_val2017.json
+│   ├── person_keypoints_train2017.json
+│   └── person_keypoints_val2017.json
+├── train2017
+│   ├── 000000000009.jpg
+│   ├── 000000000025.jpg
+│   ├── 000000000030.jpg
+│   ├── 000000000034.jpg
+|   ...
+└── val2017
+    ├── 000000000139.jpg
+    ├── 000000000285.jpg
+    ├── 000000000632.jpg
+    ├── 000000000724.jpg
+    ...
+```
+## COCO Annotations:
+We are preseting the basics of the COCO annotations in this section, since in most cases the COCO API is used to help us easily access the data and labels from the complex json annotation files. More details please refer to the [official documentation](https://cocodataset.org/#format-data).
+
+For example, the data structure of `instances_train2017.json` is as follows:
+```json
+{
+    "info": {
+        "description": "COCO 2017 Dataset",
+        "url": "http://cocodataset.org",
+        "version": "1.0",
+        ...
+    },
+    "licenses": {
+        {
+            "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
+            "id": 1,
+            "name": "Attribution-NonCommercial-ShareAlike License"
+        },
+        ...
+    },
+    "images": [
+        {
+            "license": 4,
+            "file_name": "000000397133.jpg",
+            "coco_url": "http://images.cocodataset.org/val2017/000000397133.jpg",
+            "height": 427,
+            "width": 640,
+            "date_captured": "2013-11-14 17:02:52",
+            "flickr_url": "http://farm7.staticflickr.com/6116/6255196340_da26cf2c9e_z.jpg",
+            "id": 397133
+        },
+        ...
+    ],
+    "annotations": [
+        {
+            "segmentation": RLE or [polygon],
+            "area": float,
+            "iscrowd": 0 or 1,
+            "image_id": int,
+            "bbox": [x, y, width, height],
+            "category_id": int,
+            "id": int
+        },
+        ...
+    ],
+    "categories": [
+        {
+            "supercategory": str,
+            "id": int,
+            "name": str
+        },
+        ...
+    ]
+}
+```
+
+### `images`
+`images` field contains the image information of the training set, such as `filename`, `width`, `height`, and `id`. The `id` is unique for each image and is used to index the image data from the dataset.
+
+### `categories` 
+`categories` field contains the class/label names as strings. Each category is assigned a unique category `id` for easy access.
+
+### `annotations`
+`annotations` field contains all the **object instances**, each labeled with a series of annotations.
+> Note that the number of object instances is usually much larger than the number of images, since there are often multple objects in one image.
+
+Each `annotation` has the following fields:
+#### -`id`
+  * int, the instance id, each annotation has one unique id.
+#### -`image_id`
+  * int, is used to indentify which image the current object belongs to.
+#### -`category_id`
+  * int, is used to identify the category
+#### -`bbox`
+  * [x, y, width, height], the bounding box coordinates.
+  
+    The format is **[box top-left corner x, box top-left corner y, box width, box height]**. Note that the [0, 0] coordinates is the top-left corner of the image.
+#### -`iscrowd`
+  * 0 or 1, `iscrowd=1` is used to label large groups of people.
+#### -`segmentation`
+  * `RLE` or `[polygon]`, if `iscrowd=0`, return `[polygon]`.
+    
+    The `[polygon]` is a set of points for the object mask, used for single object. The format is `[x0, y0, x1, y1, x2, y2, ...]`.
+
+    The `RLE(Run Length Encoding)` is , used for group of objects. `RLE` fields looks like:
+    ```json
+    segmentation:
+    {
+        "counts": [179, 27, 392 ...],
+        "size": [
+            426,
+            640,
+        ]
+    }
+    ```
+    RLE is an encoding scheme used to represent if each pixel belongs to either foreground or background. The `size` field stores the width and height of the image. The `counts` fields stores the number of pixels in either foreground or background continuously. 
+    
+    For example, we have the following image and the target mask:
+    ![img](./coco_dataset_1.png)
+    The RLE encoding counts the number of pixels belongs to background (from top-left corner, row by row), until meets the foreground pixel. This number is stored in `counts`, then the number of foreground pixels are counted and stored in the `counts`:
+    ![img](./coco_dataset_2.png)
+
+
+
+> An intuitive illustration of coco annotation can be found in YouTube video [here](https://www.youtube.com/watch?v=h6s61a_pqfM).
+
+In most cases, we do not need to worry about the annotation format when we are creating the training dataset for our model. The `COCO API` provides a series of api functions for us to easily obtain the image data and the target labels for our task.
+
+## PaddleViT: COCODataset
+COCO dataset has a public python API named `pycocotools` for users to easily load and use COCO dataset for detection, segmentation and other cv tasks. In this section, we will go through the PaddleViT implementation of COCO detection dataset by utilizing the `pycocotools` for both training and validation.
+
+### `CocoDataset` Class
+`CocoDataset` class is created which implements `paddle.io.Dataset` class, two method `__getitem__` and `__len__` are required, i.e.:
+```python
+class CocoDetection(paddle.io.Dataset):
+    def __init__(self, image_folder, anno_file, transforms, return_mask):
+        super().__init__()
+        ...
+    def __getitem__(self, idx):
+        ...
+    def __len__(self):
+        ...
+```
+
+#### `__init__` method
+In the class init method:
+1. Load the anno file of coco dataset, by calling pycocotools api.
+2. Obtain image ids and remove those without annotations.
+3. Set data transforms (preprocessor) by init argument (we discuss this part in next section).
+4. Define labeling conversion methods. (details in next sections)
+```python
+from pycocotools.coco import COCO
+...
+class CocoDataset():
+    def __init__(self):
+        super().__init__()
+        # step1
+        self.coco = COCO(anno_file)
+        # step2
+        ids = list(sorted(self.coco.imgs.keys()))
+        self.ids = self._remove_images_without_annotations(ids)
+        # step3
+        self._transforms = transforms
+        # step4
+        self.prepare = ConvertCocoPolysToMasks(return_masks)
+        self.root = img_folder
+```
+
+
+#### `__getitem__` method
+`__getitem__` method takes an index as input and outputs an `(image, target)` pair which contains a single image data and its target labels. In coco detection, this target is a `dict` similar to:
+```
+target = {'image_id': image_id, 'annotations': target}
+```
+`image_id` is the same image id in coco annotations.
+`targe` is a dict of keys-value pairs such as `bbox` and `mask`.
+
+The `__getitem__` method defines:
+1. loads the specified image and its labels using COCO API
+2. convert the labels (such as convert the mask from polygon to mask array)
+3. feed into the transforms for data preprocessing
+
+```python
+def __getitem__(self, idx):
+    image_id = self.ids[idx]
+    image = self._load_image(image_id)
+    target = self._load_target(image_id)
+    target = {'image_id': image_id, 'annotations': target}
+
+    image, target = self.prepare(image, target)
+    if self._transform is not None:
+        image, target = self._transform(image, target)
+    return image, target
+```
+
+
+
+#### `__len__` method
+Return the number of samples in the dataset, which is the same as the length of `ids`:
+```python
+def __len__(self):
+    return len(self.ids)
+```
+
+
+#### `_load_image`, `_load_target` methods
+`PIL.Image` and `COCO API` is used to obtain image data and the original target labels, given the index.
+```python
+def _load_image(self, idx):
+    """ Return PIL Image (RGB) according to COCO image id"""
+    path = self.coco.loadImgs(idx)[0]['file_name']
+    return Image.open(os.path.join(self.root, path)).convert('RGB')
+  
+def _load_target(self, idx):
+    """ Return image annos according to COCO image id"""
+    return self.coco.loadAnns(self.coco.getAnnIds(idx))
+```
+   
+### `ConvertCocoPolysToMasks` Class
+This class defines class calls that takes image and label as input and outputs image array and processed labels.
+
+Specifically for the target labels:
+1. Eliminate the images that `iscrowd=1`
+2. Convert the bboxes in `[x1, y1, x2, y2]` as type numpy ndarray, then clip the bbox inside the image
+3. Convert the class labels to numpy ndarray
+4. If returns mask (for segmentation), convert the polygon data into mask array by using coco api.
+5. If returns keypoints (for keypoint detection), load the keypoints into ndarray.
+6. Eliminate the boxes which areas are 0
+7. Save the processed labels in `target` dict.
+
+> Note: we are using numpy ndarray instead of paddle tensor because current paddlepaddle may raise errors in data propressing using GPU tensors.
+
+Detailed implementations are available in the source code [here](https://github.com/BR-IDL/PaddleViT/blob/5ba4761845f06f66ba3a89907e0769e1850bcab2/object_detection/DETR/coco.py#L117).
+
+### `Transforms` Module
+Multiple data propressing methods are defined in transforms module (`transforms.py`). The reason of defining our own module instead of using paddle vision transforms is that each data transform must be applied on both the image data and its target labels such as bbox and mask. Assume a random crop op is applied on image data during the training, the bboxes in this image must apply the same cropping.
+
+#### Validation transforms
+Validation transforms for DETR has the following ops:
+* `RandomResize()`: resize the image and labels to certrain size with same aspect ratio.
+* `ToTensor()`: convert the image data into `paddle.Tensor`
+* `Normalize()`: $-mean$ and $/std$
+
+#### Training transforms
+Training transforms for DETR has the following ops:
+
+* `RandomHorizontalFlip()` that randomly flip the data horizontally.
+* `RandomSelect()` that randomly selects one of two its sub operations: (1) a single `RandomResize` step; (2) a 3-step op: `RandomReize`, `RandomSizeCrop`, and `RandomResize`
+* `ToTensor()`: convert the image data into `paddle.Tensor`
+* `Normalize()`: image data normalization, $-mean$ and $/std$
+
+#### `RandomHorizontalFlip()`
+This transform takes a probability as init argument controls the randomness of applying flip or not. 
+```
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image, target):
+        if random.random() < self.p:
+            return hflip(image, target)
+        return image, target
+```
+The `hflip` method defines the horizontal flip operation for both the image and the target (dict contains the ground truth such as bounding boxes and masks).
+
+#### `RandomSelect()`
+`RandomSelect()` has a prob value controls the randomness of which one of its two sub op is selected.
+```python
+class RandomSelect():
+    """ Random select one the transforms to apply with probablity p"""
+    def __init__(self, transforms1, transforms2, p=0.5):
+        self.transforms1 = transforms1
+        self.transforms2 = transforms2
+        self.p = p
+
+    def __call__(self, image, target):
+        if random.random() > self.p:
+            return self.transforms1(image, target)
+        return self.transforms2(image, target)
+ ```
+
+ Two transform ops are used in DETR training:
+ - `RandomResize()`
+ - `RandomResize()` + `RandomSizeCrop()` + `RandomResize()`
+
+
+
+#### `RandomResize()`
+`RandomResize` takes two arguments: `sizes` and `max_size`. The method randomly select one of the value in `sizes` as the target size of the **shorter side** of the image, while keep the aspect ratio unchanged. However, if the **longer side** of the image is larger then the `max_size` (when using the selected size as the shorter side), the longer side of the image is set as the `max_size` while the shorter size is re-calculated (not the selected size) to keep the image aspect ratio unchanged.
+
+The same resize op must be applied to the bboxes and masks. The boxes can be converted by multiplying the height and width scale ratios. The masks can be converted by an interpolation and binarization (where values > 0.5 are set to 1 otherwise 0) to get the scaled masks.
+
+
+#### `RandomSizeCrop()`
+`RandomSizeCrop` takes `min_size` and `max_size` as inputs,  then a random region within the image is cropped as the output. This region is of size `[randint(min_size, max_size), randint(min_size, max_size)]`.
+
+`RandomSizeCrop` is implemented in 3 steps:
+* STEP1: Generate random image width and image height given `min_size`, `max_size` and original image size.
+* STEP2: Given the cropped image size, randomly select the position of the crop region within the image. This region can be represented by `[top, left, height, width]`.
+* STEP3: Given the cropped region, crop the image and the target labels, such as bboxes and masks.
+
+Specifically, we implement a `crop` method with its inputs (1) cropped region in `[top, left, height, width]`, (2) original image and (3) target labels, and returns the cropped image and cropped labels. Note that after the crop, the original boxes or masks will also be cropped or even unseen in the cropped image, therefore we have to eliminate those invalid boxes and masks from the target labels.
+
+#### `ToTensor()`
+`ToTensor` converts the image data from PIL.Image to paddle.Tensor, returns the image tensor and the corresponding labels. It can be easily implemented like:
+```python
+import paddle.vision.transforms as T
+class ToTensor:
+    def __call__(self, image, target):
+        return T.to_tensor(image), target
+```
+
+#### `Normalize()`
+In `Normalize` method, besides the data nomalization (-mean & /std), we also normalize the bboxes from `[x0, y0, x1, y1]` to `[cx, cy, w, h]`, and rescale to the relative coordinates according to the image size. Specifically:
+```python
+class Normalize():
+    def __init__(self, mean, std):
+        self.mean = mean
+        self.std = std
+
+    def __call__(self, image, target=None):
+        # -mean, / std
+        image = T.functional.normalize(image, mean=self.mean, std=self.std)
+        if target is None:
+            return image, None
+        target = target.copy()
+        # from xyxy -> cxcywh -> relative coords
+        h, w = image.shape[-2:]
+        if 'boxes' in target and target['boxes'].shape[0] != 0:
+            boxes = target['boxes']
+            boxes = box_xyxy_to_cxcywh_numpy(boxes)
+            boxes = boxes / np.array([w, h, w, h], dtype='float32')
+            target['boxes'] = boxes
+
+        return image, target
+```
diff --git a/docs/paddlevit-config-cn.md b/docs/paddlevit-config-cn.md
new file mode 100644
index 00000000..19c2a40e
--- /dev/null
+++ b/docs/paddlevit-config-cn.md
@@ -0,0 +1,132 @@
+简体中文 | [English](./paddlevit-config.md)
+
+## PaddleViT: 如何使用 config ?
+> 示例代码: [here](../image_classification/ViT/config.py)
+
+本文档介绍了**PaddleViT** 项目中使用`config` 的基础知识。
+
+PPViT `config`中使用的核心模块是[yacs](https://github.com/rbgirshick/yacs) (0.1.8+). 与其他项目相似，PPViT `config`支持从[yaml](https://yaml.org/)文件加载，并且可以使用python [ArgumentParser](https://docs.python.org/3/library/argparse.html)进行配置。
+
+> `yacs` 的完整用法可以在https://github.com/rbgirshick/yacs 中找到
+
+### 1. 安装
+#### 1.1 通过 `pip`安装
+安装 `yacs` 版本 `0.1.8`:
+```shell
+$ pip install yacs==0.1.8
+```
+#### 1.2 从源码安装
+你也可以从github下载 `yacs` :
+```shell
+$ git clone https://github.com/rbgirshick/yacs.git
+$ cd yacs
+$ python setup.py install
+```
+
+### 2. 基本概念和用法
+#### 1. CfgNode
+`CfgNode` 表示配置树中的一个内部节点，它是一个类似`dict`的容器，允许基于属性对其键进行访问。
+```python
+from yacs.config import CfgNode as CN
+
+_C = CN()
+
+_C.NUM_GPUS = 4
+_C.NUM_WORKERS = 8
+_C.BATCH_SIZE = 128
+
+def get_config():
+    return _C.clone()
+```
+#### 2. 使用 `merge_from_file()`读取 `.yaml`
+`yacs`允许读取YAML文件来覆盖 `CfgNode`. 您可以为每一个实验创建一个`.yaml` 文件，它只会更改实验中的选项。
+
+YAML文件的一些基本格式:
+```YAML
+key:    # YAML uses 'key: value' paris, separated using ':'
+    child_key: value    # indent can be used to show different levels
+    child_KEY2: value3  # YAML is case sensitive
+    c_arr: [val1, val2, val3]   # array can be used in value
+    c_bool: True    # True/true/TRUE are all OK
+    c_float: 3.1415 # float is allowed
+    c_string: no quote is allowed # "", '', no quote are all OK
+```
+
+`merge_from_file()` 可用于覆盖当前 `CfgNode`:
+```python
+cfg = get_config()
+cfg.merge_from_file('experiment_1.yaml')
+print(cfg)
+```
+
+#### 3. 通过 `ArgumentParser`更新配置
+您可以使用python `ArgumentParser` 编写您自己的方法来更新配置，例如：
+```python
+def update_config(config, args)
+    if args.cfg:    # update from .yaml file
+        upate_config_from_file(config, args.cfg)
+    if args.batch_size: # update BATCH_SIZE
+        config.BATCH_SIZE = args.batch_size
+    if args.eval:
+        config.EVAL = True
+    return config
+```
+
+
+
+
+### 4. PPViT配置的使用指南:
+#### STEP 1: 创建 config.py
+创建一个python文件config.py， 用于定义 **所有配置选项**。 它应该为所有选项提供合适的默认值并记录下来。
+通常，`config.py`应该有：
+- `DATA`: 定义数据集路径、输入图像尺寸和batch_size等。
+- `MODEL`:
+    - 模型的常规选项，例如模型名称，类别数量等。
+    - `TRANS`: transformer的相关选项，例如mlp维度，hidden维度，heads数量等。
+- `TRAIN`: 与训练相关的选项，例如epochs, lr, weight decay等。
+
+在`config.py`中，你应该实现`update_config(config, args)`，它是从`ArgumentParser`中读取当前的 `config` 和 `args`以使用命令行选项更新配置。
+
+#### STEP 2: 
+在你的`main.py`中，创建`ArgumentParser`，它包含`config.py`中`update_config(config, args)` 方法中的所有选项，例如：
+
+```python
+  parser = argparse.ArgumentParser('ViT')
+  parser.add_argument('-cfg', type=str, default=None)
+  parser.add_argument('-dataset', type=str, default=None)
+  parser.add_argument('-batch_size', type=int, default=None)
+  parser.add_argument('-image_size', type=int, default=None)
+  parser.add_argument('-data_path', type=str, default=None)
+  parser.add_argument('-ngpus', type=int, default=None)
+  parser.add_argument('-pretrained', type=str, default=None)
+  parser.add_argument('-eval', action='store_true')
+  args = parser.parse_args()
+
+  # get default config
+  config = get_config()
+  # update config by arguments
+  config = update_config(config, args)
+```
+
+然后，您可以使用基于属性的访问来获取配置选项值。
+
+#### STEP 3:
+你应该为每个实验创建一个单独的`.yaml` 文件，例如：
+```yaml
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViT
+    NAME: vit_large_patch16_224
+    TRANS:
+    PATCH_SIZE: 16
+    HIDDEN_SIZE: 1024
+    MLP_DIM: 4096 # same as mlp_ratio = 4.0
+    NUM_LAYERS: 24
+    NUM_HEADS: 16
+    QKV_BIAS: True
+```
+
+如果你将命令行参数`-cfg`设置为`.yaml` 文件路径，配置将被文件选项覆盖。
+> 注意：`.yaml`覆盖了 `args`之前的配置，因此`args`中的选项是当前选项。
diff --git a/docs/paddlevit-config.md b/docs/paddlevit-config.md
index b5af2154..986b0a94 100644
--- a/docs/paddlevit-config.md
+++ b/docs/paddlevit-config.md
@@ -1,3 +1,5 @@
+English | [简体中文](./paddlevit-config-cn.md)
+
 ## PaddleViT: How to use config?
 > sample code: [here](../image_classification/ViT/config.py)
 
diff --git a/docs/paddlevit-multi-gpu-cn.md b/docs/paddlevit-multi-gpu-cn.md
new file mode 100644
index 00000000..bbcbe310
--- /dev/null
+++ b/docs/paddlevit-multi-gpu-cn.md
@@ -0,0 +1,147 @@
+简体中文 | [English](./paddlevit-multi-gpu.md)
+
+## PaddleViT: 如何使用多GPU ?
+本文档介绍如何使用和如何实现多GPU（单结点）以在`PaddleViT`中训练和评估模型的方法。
+
+`PaddleViT`实现基于`paddle.distributed` 包的多GPU方案，此外我们还提供了一些用于GPU间通信和数据传输的有用功能。
+
+> 详细的官方 `paddle.distribued` 文档可见：[here](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/Overview_cn.html)
+
+
+### 1. 如何使用多GPU进行训练/验证？
+在`PaddleViT`中，多GPU使用方法简单明了。通常，使用一个脚本文件（例如，`run_train_multi.sh`）来运行实验。 这个`.sh`脚本通过命令行选项运行python文件（例如`main_multi_gpu.py`）。
+
+例如，验证脚本 `run_eval_multi.sh` 调用带有多个参数的 `main_multi_gpu.py` :
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/vit_base_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=16 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./vit_base_patch16_224' \
+```
+在这个shell脚本中:
+- `CUDA_VISIBLE_DEVICES` 设置将使用哪些 GPU.
+- `batch_size` 设置在单个GPU上的batch_size .
+
+通过运行以下shell脚本可以开始多GPU训练实验，例如：
+```
+$ sh run_train_multi.sh
+```
+
+### 2. PaddleViT中的多GPU方案是如何实现的?
+#### STEP 0: 准备
+我们在`PaddleViT`中使用`paddle.distributed` 包：
+```python
+import paddle.distributed as distt
+```
+
+我们介绍了在多GPU上训练/验证的基本概念和步骤：
+- 启动多个子流程
+- 每一个进程在1个GPU上运行
+- 每个进程运行自己的训练/验证
+- 数据集被拆分，每个进程处理整个数据集的一部分
+- 在每个GPU上，前向过程应用于其自己的批处理数据。
+- 收集并平均每个GPU上的梯度。
+- 每次迭代的平均梯度在每个GPU上同步。
+- 使用平均梯度在每个GPU上应用梯度下降。
+- 跨所有GPU收集验证结果。
+- GPU之间的通信基于`NCCL2`.
+
+
+#### STEP 1: 创建 `main` 方法
+定义一个`main`方法包含以下步骤：
+1. 创建`dataset`和`dataloader`。（见第2步）
+2. 获取并设置使用的GPU数量。
+3. 为多GPU训练/验证启动多处理。
+
+`main`方法可能类似于：
+```python
+def main():
+    dataset_train = get_dataset(config, mode='train')
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+```
+其中
+- `paddle.static.cuda_places()`获取当前环境中所有可用的GPU. 
+- `dist.spawn` 启动 `multiprocessing`
+- `main_worker` 包含完整的训练/验证过程。
+- `args` 将数据集发送到所有子进程。
+- `nprocs` 确定要启动的子进程的数量，将其设置为GPU的数量。
+
+#### STEP 2: 创建 `dataset` 和 `dataloader`
+1. Dataset
+
+    `dataset` 的定义方式和使用单GPU时的方式相同.通常，你需要创建一个实现 `paddle.io.Dataset`的数据集类. 需要实现`__getitem__` 和 `__len__` 方法，用于读取数据并获取整个数据集的总长度。 
+
+   在我们的多GPU方案中，我们在主进程中创建一个single `dataset` ,它将通过`dist.spawn`中的`args`（作为参数）传递给所有子进程。
+2. Dataloader
+
+    `dataloader` 定义了如何加载批处理数据，你可以创建一个 `paddle.io.DataLoader` ，将  `paddle.io.Dataset` 和  `DistributedBatchSampler` 作为其输入。其他常用的输入参数是  `batch_size(int)`, `shuffle(bool)` 和 `collate_fn`.
+
+    对于多GPU方案， `DistributedBatchSampler` 用于将数据集拆分为 `num_replicas` 并为每个进程/GPU (`rank`)采样批处理数据.  例如：
+    ```python
+    sampler = DistributedBatchSampler(dataset,
+                                    batch_size=batch_size,
+                                    shuffle=(mode == 'train'))
+    ```
+    dataloader在每个进程中初始化（意味着您需要在`main_worker` 方法中初始化实例）， `num_replicas` 和 `rank` 将由分布式环境自动确定。
+
+#### STEP 3: Multi-GPU 训练
+在STEP1中，`dist.spawn` 中的第一个参数是 `main_worker`, 这是包含完整训练/验证过程的方法。
+你可以理解 `main` 方法在主进程(master)上运行, 它启动了许多子进程(workers). 
+这些子进程运行`main_worker`中定义的内容.
+
+具体来说, 在 `main_worker` 中有以下内容:
+1. 初始化分布式环境: `dist.init_paralel_env()`
+2. (可选) 获取world-size: `dist.get_world_size()`
+3. (可选) 获取当前rank: `dist.get_rank()`
+4. 为多GPU准备模型: `model=paddle.DataParallel(model)`
+5. 使用 `DistributedBatchSampler`获取dataloader
+6. 训练 (与single-gpu相同)
+
+#### STEP 4: Multi-GPU 验证
+在用于验证的 `main_worker` 中，我们将有以下内容:
+1. 初始化分布式环境: `dist.init_paralel_env()`
+2. 为多GPU准备环境： `model=paddle.DataParallel(model)`
+3. 使用 `DistributedBatchSampler`获取dataloader
+4. 验证(同single-gpu)
+5. 对于每次迭代, **在所有GPU上收集结果**
+
+由于每个进程/GPU对其自己的批处理数据进行推理，我们必须收集这些结果以获取整体性能。在Paddle中， `paddle.distributed.all_reduce`跨多个GPU收集张量，可以在每次迭代中调用：
+```python
+output, _ = model(image) # inference
+loss = criterion(output, label) # get loss
+
+pred = F.softmax(output) # get perds
+acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1)) # top1 acc
+ 
+dist.all_reduce(loss) # gather loss from all GPUs
+dist.all_reduce(acc1) # gather top1 acc from all GPUS
+ 
+loss = loss / dist.get_world_size() # get average loss 
+acc1 = acc1 / dist.get_world_size() # get average top1 acc
+```
+请注意，，默认 `all_reduce` 返回GPU之间张量值的`SUM`，因此我们除以`world_size`以获取平均值。
+
+最后，可以使用 `AverageMeter` 将结果记录为使用单GPU:
+```python
+batch_size = paddle.to_tensor(image.shape[0])
+dist.all_reduce(batch_size)
+val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+```
+
+### 3. 高级应用
+对于需要在`PaddleViT`中的多个GPU之间进行高级通信/数据传输的开发人员，我们为`reduce`dict 对象和`gather`任何(picklable) 对象编写了两种方法，而不仅仅是`paddle.Tensor`.
+
+具体来说:
+
+- `reduce_dict(input_dict, average=True)` 被定义为一个 `dict` 存储 key: 张量对, 如果 `average`设置为 `True`,  `all_reduce` 将对于字典的每个值上按world size进行 `average` 。如果 `average` 为 `False`, 则常规的 `sum` 操作将被应用于dict中的每个值.
+
+- `all_gather(data)` 被定义为 `all_gather` 任何可选取的数据, 而不仅仅 `paddle.Tensor`. 输入是一个数据对象，输出是从每个rank手机的数据列表。
+
+> 详细的实现可以在PaddleVIT `object_detection/DETR/utils.py`中找到。
diff --git a/docs/paddlevit-multi-gpu.md b/docs/paddlevit-multi-gpu.md
index 8789e997..21a281f9 100644
--- a/docs/paddlevit-multi-gpu.md
+++ b/docs/paddlevit-multi-gpu.md
@@ -1,3 +1,5 @@
+English | [简体中文](./paddlevit-multi-gpu-cn.md)
+
 ## PaddleViT: How to use multi-gpu?
 This document presents **how to use** and **how to implement** multi-gpu (single node) for training and validation in `PaddleViT` for training and validating your model. 
 
diff --git a/docs/paddlevit-port-weights-cn.md b/docs/paddlevit-port-weights-cn.md
new file mode 100644
index 00000000..eea9a5fe
--- /dev/null
+++ b/docs/paddlevit-port-weights-cn.md
@@ -0,0 +1,94 @@
+简体中文 | [English](./paddlevit-port-weights.md)
+
+## PaddleViT: 如何将模型从 Pytorch 移植到 Paddle?
+> 源码: [here](../image_classification/SwinTransformer/port_weights/load_pytorch_weights.py)
+
+### Step 0:
+如果你想要从一些ViT模型的PyTorch实现转换到Paddle版本，并需要将预训练权重从pytorch `.pth`文件转换为paddle`.pdparams` 文件。
+
+首先需要具有的要素:
+- 一个 `torch.nn.Module` 类在pytorch中实现模型.
+- 一个与Pytorch模型对应的 `.pth` 预训练权重文件.
+- 一个在paddle中实现相同模型的 `paddle.nn.Layer` 类.
+
+> 注意:  `paddle.nn.Layer`类必须以与你引用的 `torch.nn.Module`相似的方式实现. 此处的'similar' 表示参数大小、张量形状和计算逻辑相同，而层/参数的名称或详细实现可能不同。
+
+我们还需要实现:
+- `load_pytorch_weights.py`, 包含模型转换和名称映射的方法.
+
+接下来我们展示如何实现 `load_pytorch_weights.py`.
+
+### Step 1:
+加载paddle模型, 例如:
+ ```python
+ paddle_model = build_model(config)
+ paddle_model.eval()
+ ```
+ 你可以只初始化一个模型类用于构建一个模型对象，详细的模型定义和`config`用法请参考我们的PPViT代码。
+
+
+### Step 2:
+加载你的pytorch模型的预训练权重。
+
+ 例如，如果我们使用来自 `timm` 项目的模型:
+ ```python
+ import timm
+ torch_model = timm.create_model('vit_base_patch16_224', pretrained=True)
+ torch_model.eval()
+ ```
+> timm: https://github.com/rwightman/pytorch-image-models
+
+### Step 3:
+检查名称映射(**手动**).
+在 `torch_to_paddle_mapping` 方法中，你创建了一个字符串元组列表，定义了torch和paddle模型的相应参数和缓冲区名称，例如：
+- 在**torch** 模型中，一个名为`patch_embed.proj.weight` 的参数
+- 在 **paddle** 模型中, 相同的参数被命名为 `embeddings.patch_embeddings.weight`
+然后你有一个元组 `(patch_embed.proj.weight, embeddings.patch_embeddings.weight)` 保存在映射列表中。
+
+ > 注意: 你可以使用 **for loop** 和 **prefix strings** 来半自动化你的名称映射过程。
+ > 注意: 不要忘记为`model.named_buffers()`添加名称映射
+
+通常我们会打印torch 参数/缓存区的名称和形状，并打印paddle 参数/缓冲区的名称和形状，每个都在单独的文本文件中，然后逐行检查映射，并在必要时修改 `torch_to_paddle_mapping`.
+
+如果所有名称映射都正确，请通过以下方式运行转换：
+```python
+paddle_model = convert(torch_model, paddle_model)
+```
+> 此方法见torch中的参数权重转化为正确格式，然后将值设置为对应的paddle参数。返回的对象是具有与pytorch模型相同的预训练权重的paddle模型对象。
+
+> 在 `convert`方法中， `torch.nn.Linear`的权重应用于 `transpose`, 用于匹配 `paddle.nn.Linear`权重的维度.
+### Step 4:
+检查正确性。
+
+创建与模型输入相对应的批处理数据，例如：
+
+```python
+# check correctness
+x = np.random.randn(2, 3, 224, 224).astype('float32')
+x_paddle = paddle.to_tensor(x)
+x_torch = torch.Tensor(x).to(device)
+```
+然后进行推理，将输出转换为numpy数组：
+```
+out_torch = torch_model(x_torch)
+out_paddle = paddle_model(x_paddle)
+
+out_torch = out_torch.data.cpu().numpy()
+out_paddle = out_paddle.cpu().numpy()
+```
+最后, 检查`paddle_model` 和 `torch_model`的输出是否相同:
+```python
+assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+```
+
+### Step 5:
+保存paddle的模型权重：
+```python
+paddle.save(paddle_model.state_dict(), model_path)
+```
+
+> **提示:**
+> - BN 层通常具有缓冲区，例如 `_mean`和 `_variance`
+> - 不要忘记模型中定义的自定义缓冲区, 例如, `paddle.register_buffer()`
+> - 使用批处理数据(batchsize > 1)来测试结果。
+> - 一些参数是二维但非线形参数，所以`_set_value` 必须设置为 `transpose=False`.
diff --git a/docs/paddlevit-port-weights.md b/docs/paddlevit-port-weights.md
index a6912d4b..3bb73f9e 100644
--- a/docs/paddlevit-port-weights.md
+++ b/docs/paddlevit-port-weights.md
@@ -1,5 +1,7 @@
+English | [简体中文](./paddlevit-port-weights-cn.md)
+
 ## PaddleViT: How to port model from Pytorch to Paddle?
-> Sample code: [here](../image_classification/ViT/load_pytorch_weights.py)
+> Sample code: [here](../image_classification/SwinTransformer/port_weights/load_pytorch_weights.py)
 
 ### Step 0:
 We assume you are trying to implement your Paddle version of some ViT model, from some PyTorch implementations. You want to port the pretrained weights from pytorch `.pth` file to paddle `.pdparams` file.
diff --git a/edu/README.md b/edu/README.md
new file mode 100644
index 00000000..1a40c77c
--- /dev/null
+++ b/edu/README.md
@@ -0,0 +1,55 @@
+# PaddleViT在线课: 从零开始学视觉Transformer
+<p align="center">    
+    <img src="./class_fig.png" width="80%"/>
+</p>
+
+## 飞桨精选项目课程
+论文分析+逐行coding，从零开始带你掌握视觉Transformer前沿技术。
+
+## 课程链接： 
+AIStudio(包括视频和作业等): [《从零开始学视觉Transformer》](https://aistudio.baidu.com/aistudio/course/introduce/25102)
+
+YouTube： [here](https://youtu.be/ucmm7yglzgo) (更新中)
+
+## 课程代码：
+[PaddleViT/edu](./)
+
+## 课程简介：
+Vision Transformer是近期深度学习领域最前沿、最火爆的技术，本次课程由百度研究院深度学习实验室研究员朱博士主讲，将通过图解理论基础、手推公式以及从0开始逐行手敲代码，带大家实现最前沿的视觉Transformer算法！
+
+无论你是刚接触深度学习，还是已经在做科研，无论你是CV想转NLP，还是NLP想搞CV，又或者你想用最新的视觉技术打比赛、发论文，这门课程都会给你们带来一些不一样的体验。
+
+## 课程大纲：
+### 课程目标
+通过Vision Transformer十讲的学习，能一步一步将论文中的模型图变成一行行的代码，从零搭建一套自己的深度学习模型，掌握和实践最新的技术，告别简单的git clone和调包。
+### 课程列表
+#### 第1讲
+- 理论：什么是Vision Transformer？
+- 实践：Warmup：模型搭建和训练
+#### 第2讲
+- 理论：从Transformer到Vision Transformer
+- 实践：玩转Tensor操作，开始搭建ViT
+#### 第3讲
+- 理论：你看你的，我看我的:详解注意力
+- 实践：Multi-Head Self Attention
+#### 第4讲
+- 理论：详解第一个ViT算法
+- 实践：如何实现ViT模型
+#### 第5讲
+- 理论：ViT模型搭建好了，如何高效训练？
+- 实践：实战模型搭建和训练
+#### 第6讲
+- 理论：什么是Window Attention？
+- 实践：图像窗口上的注意力机制
+#### 第7讲
+- 理论：大名鼎鼎的Swin Transformer
+- 实践：实现你的第二个ViT模型
+#### 第8讲
+- 理论：下一个算法：Conv和Transformer的结合
+- 实践：从框架源码看如何实现数据加载
+#### 第9讲
+- 理论：前沿算法介绍：视觉上的BERT？ BeiT & MAE
+- 实践：模型训练的技巧
+#### 第10讲 & 第11讲
+- 理论：检测算法新范式-DETR
+- 实践：实战ViT训练测试全流程
\ No newline at end of file
diff --git a/edu/class0/dataset.py b/edu/class0/dataset.py
new file mode 100644
index 00000000..074e865e
--- /dev/null
+++ b/edu/class0/dataset.py
@@ -0,0 +1,29 @@
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.vision import datasets
+from paddle.vision import transforms
+
+
+def get_transforms(mode='train'):
+    if mode == 'train':
+        data_transforms = transforms.Compose([
+            transforms.RandomCrop(32, padding=4),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010])])
+    else:
+        data_transforms = transforms.Compose([
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010])])
+    return data_transforms
+
+
+def get_dataset(name='cifar10', mode='train'):
+    if name == 'cifar10':
+        dataset = datasets.Cifar10(mode=mode, transform=get_transforms(mode))
+
+    return dataset
+
+def get_dataloader(dataset, batch_size=128, mode='train'):
+    dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=2, shuffle=(mode == 'train'))
+    return dataloader
diff --git a/edu/class0/main_single_gpu.py b/edu/class0/main_single_gpu.py
new file mode 100644
index 00000000..dbdcff99
--- /dev/null
+++ b/edu/class0/main_single_gpu.py
@@ -0,0 +1,92 @@
+import paddle
+import paddle.nn as nn
+from resnet18 import ResNet18
+from dataset import get_dataset
+from dataset import get_dataloader
+from utils import AverageMeter
+
+
+def train_one_epoch(model, dataloader, criterion, optimizer, epoch, total_epoch, report_freq=10):
+    print(f'----- Training Epoch [{epoch}/{total_epoch}]:')
+    loss_meter = AverageMeter()
+    acc_meter = AverageMeter()
+
+    model.train()
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+
+        out = model(image)
+        loss = criterion(out, label)
+
+        loss.backward()
+        optimizer.step()
+        optimizer.clear_grad()
+
+        pred = nn.functional.softmax(out, axis=1)
+        acc1 = paddle.metric.accuracy(pred, label.unsqueeze(-1))
+
+        batch_size = image.shape[0]
+        loss_meter.update(loss.cpu().numpy()[0], batch_size)
+        acc_meter.update(acc1.cpu().numpy()[0], batch_size)
+        if batch_id > 0 and batch_id % report_freq == 0:
+            print(f'----- Batch[{batch_id}/{len(dataloader)}], Loss: {loss_meter.avg:.4}, Acc@1: {acc_meter.avg:.2}')
+
+    print(f'----- Epoch[{epoch}/{total_epoch}], Loss: {loss_meter.avg:.4}, Acc@1: {acc_meter.avg:.2}')
+
+
+def validate(model, dataloader, critertion):
+    print(f'----- Validation')
+    loss_meter = AverageMeter()
+    acc_meter = AverageMeter()
+
+    model.eval()
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+
+        out = model(image)
+        loss = criterion(out, label)
+
+        pred = nn.functional.softmax(out, axis=1)
+        acc1 = paddle.metric.accuracy(pred, label.unsqueeze(-1))
+
+        batch_size = image.shape[0]
+        loss_meter.update(loss.cpu().numpy()[0], batch_size)
+        acc_meter.update(acc1.cpu().numpy()[0], batch_size)
+        if batch_id > 0 and batch_id % report_freq == 0:
+            print(f'----- Batch[{batch_id}/{len(dataloader)}], Loss: {loss_meter.avg:.4}, Acc@1: {acc_meter.avg:.2}')
+    print(f'----- Validation Loss: {loss_meter.avg:.4}, Acc@1: {acc_meter.avg:.2}')
+
+
+def main():
+    total_epoch = 200
+    batch_size = 16
+
+    model = ResNet18(num_classes=10)
+    
+    train_dataset = get_dataset(mode='train')
+    train_dataloader = get_dataloader(train_dataset, batch_size, mode='train')
+    val_dataset = get_dataset(mode='test')
+    val_dataloader = get_dataloader(val_dataset, batch_size, mode='test')
+
+    criterion = nn.CrossEntropyLoss()
+    scheduler = paddle.optimizer.lr.CosineAnnealingDecay(0.02, total_epoch)
+    optimizer = paddle.optimizer.Momentum(learning_rate=scheduler,
+                                          parameters=model.parameters(),
+                                          momentum=0.9,
+                                          weight_decay=5e-4)
+    
+    for epoch in range(1, total_epoch+1):
+        train_one_epoch(model,
+                        train_dataloader,
+                        criterion,
+                        optimizer,
+                        epoch,
+                        total_epoch)
+        scheduler.step()
+        validate(model, val_dataloader, criterion)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class0/resnet18.py b/edu/class0/resnet18.py
new file mode 100644
index 00000000..cdee8d2d
--- /dev/null
+++ b/edu/class0/resnet18.py
@@ -0,0 +1,89 @@
+import paddle
+import paddle.nn as nn
+#paddle.set_device('cpu')
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+class Block(nn.Layer):
+    def __init__(self, in_dim, out_dim, stride):
+        super().__init__()
+        self.conv1 = nn.Conv2D(in_dim, out_dim, 3, stride=stride, padding=1, bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(out_dim)
+        self.relu = nn.ReLU()
+        self.conv2 = nn.Conv2D(out_dim, out_dim, 3, stride=1, padding=1, bias_attr=False)
+        self.bn2 = nn.BatchNorm2D(out_dim)
+
+        if in_dim != out_dim or stride == 2:
+            self.downsample = nn.Sequential(*[
+                nn.Conv2D(in_dim, out_dim, 1, stride=stride),
+                nn.BatchNorm2D(out_dim)])
+        else:
+            self.downsample = Identity()
+
+    def forward(self, x):
+        h = x
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.conv2(x)
+        x = self.bn2(x)
+        identity = self.downsample(h)
+        x = x + identity
+        x = self.relu(x)
+        return x
+
+class ResNet18(nn.Layer):
+    def __init__(self, in_dim=64, num_classes=10):
+        super().__init__()
+        self.conv1 = nn.Conv2D(in_channels=3,
+                               out_channels=in_dim,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(in_dim)
+        self.relu = nn.ReLU()
+        self.in_dim = in_dim
+        
+        self.layer1 = self._make_layer(dim=64, n_blocks=2, stride=1) # 16x16 -> 16x16
+        self.layer2 = self._make_layer(dim=128, n_blocks=2, stride=2) # 16x16 -> 8x8
+        self.layer3 = self._make_layer(dim=256, n_blocks=2, stride=2) # 4x4
+        self.layer4 = self._make_layer(dim=512, n_blocks=2, stride=2) # 2x2
+        self.avgpool = nn.AdaptiveAvgPool2D(1)
+        self.classifier = nn.Linear(512, num_classes)
+
+    def _make_layer(self, dim, n_blocks, stride):
+        layer_list = []
+        layer_list.append(Block(self.in_dim, dim, stride=stride))
+        self.in_dim = dim
+        for i in range(1, n_blocks):
+            layer_list.append(Block(self.in_dim, dim, stride=1))
+        return nn.Sequential(*layer_list)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        # x = self.maxpool(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.avgpool(x)
+        x = x.flatten(1)
+        x = self.classifier(x)
+        return x
+
+def main():
+    t = paddle.randn([4, 3, 32, 32])
+    model = ResNet18()
+    print(model)
+    out = model(t)
+    print(out.shape)
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class0/utils.py b/edu/class0/utils.py
new file mode 100644
index 00000000..5e853b37
--- /dev/null
+++ b/edu/class0/utils.py
@@ -0,0 +1,19 @@
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
diff --git a/edu/class1/main_1.py b/edu/class1/main_1.py
new file mode 100644
index 00000000..004b4cd1
--- /dev/null
+++ b/edu/class1/main_1.py
@@ -0,0 +1,62 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https://github.com/BR-IDL/PaddleViT)
+# 2021.11
+import paddle
+import numpy as np
+from PIL import Image
+
+paddle.set_device('cpu')
+
+def main():
+    # 1. Create a Tensor
+    t = paddle.zeros([3, 3])
+    print(t)
+
+    # 2. Create a random Tensor
+    t = paddle.randn([4, 3])
+    print(t)
+
+    # 3. Create a tensor from Image ./724.jpg 28x28
+    img = np.array(Image.open('./724.jpg'))
+    for i in range(28):
+        for j in range(28):
+            print(f'{img[i, j]:03} ', end='')
+        print()
+    t = paddle.to_tensor(img, dtype='float32')
+
+    # 4. print tensor type and dtype of tensor
+    print(type(t))
+    print(t.dtype)
+    
+    # 5. transpose image tensor
+    t = t.transpose([1, 0])
+    for i in range(28):
+        for j in range(28):
+            print(f'{int(t[i, j]):03} ', end='')
+        print()
+
+    # 6. Reshape a random int tensor from 5x5 to 25
+    t = paddle.randint(0, 10, [5, 5])
+    print(t)
+    t1 = t.reshape([25])
+    t2 = t.flatten(0)
+    print(t1)
+    print(t2)
+
+    # 7. Unsqueeze a random int tensor from 5x5 to 5x5x1
+    t = paddle.randint(0, 10, [5, 5])
+    print(t)
+    print(t.shape)
+    print(t.unsqueeze(-1).shape)
+    
+    # 8. chunk a random int tensor from 5x15 to 5x5, 5x5 and 5x5
+    t = paddle.randint(0, 10, [5, 15])
+    print(t)
+    qkv = t.chunk(3, -1)
+    print(type(qkv))
+    q, k, v = qkv
+    print(q)
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class1/main_2.py b/edu/class1/main_2.py
new file mode 100644
index 00000000..4f499860
--- /dev/null
+++ b/edu/class1/main_2.py
@@ -0,0 +1,92 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https://github.com/BR-IDL/PaddleViT)
+# 2021.11
+import paddle
+import paddle.nn as nn
+import numpy as np
+from PIL import Image
+paddle.set_device('cpu')
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, embed_dim, mlp_ratio, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class PatchEmbedding(nn.Layer):
+    def __init__(self, image_size, patch_size, in_channels, embed_dim, dropout=0.):
+        super().__init__()
+        self.patch_embedding = nn.Conv2D(in_channels=in_channels,
+                                         out_channels=embed_dim,
+                                         kernel_size=patch_size,
+                                         stride=patch_size,
+                                         weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                                         bias_attr=False)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        # [n, c, h, w]
+        x = self.patch_embedding(x) # [n, c', h', w']
+        x = x.flatten(2) # [n, c', h'*w']
+        x = x.transpose([0, 2, 1]) # [n, h'*w', c']
+        x = self.dropout(x)
+        return x
+
+
+def main():
+    # 1. Load image and convert to tensor
+    img = Image.open('./724.jpg')
+    img = np.array(img)
+    for i in range(28):
+        for j in range(28):
+            print(f'{img[i,j]:03} ', end='')
+        print()
+
+    sample = paddle.to_tensor(img, dtype='float32')
+    # simulate a batch of data
+    sample = sample.reshape([1, 1, 28, 28])
+    print(sample.shape)
+
+    # 2. Patch Embedding
+    patch_embedding = PatchEmbedding(image_size=28, patch_size=7, in_channels=1, embed_dim=1)
+    out = patch_embedding(sample)
+    print(out)
+    print(out.shape)
+    for i in range(0, 28, 7):
+        for j in range(0, 28, 7):
+            print(paddle.sum(sample[0, 0, i:i+7, j:j+7]).numpy().item())
+    
+
+
+    patch_embedding = PatchEmbedding(image_size=28, patch_size=7, in_channels=1, embed_dim=96)
+    out = patch_embedding(sample)
+    # 3. mlp
+    mlp = Mlp(96, 4.0)
+    out = mlp(out)
+    print(out)
+    print(out.shape)
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class1/vit.py b/edu/class1/vit.py
new file mode 100644
index 00000000..1a5f609d
--- /dev/null
+++ b/edu/class1/vit.py
@@ -0,0 +1,115 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https://github.com/BR-IDL/PaddleViT)
+# 2021.11
+import paddle
+import paddle.nn as nn
+import numpy as np
+from PIL import Image
+
+paddle.set_device('cpu')
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, embed_dim, mlp_ratio=4.0, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class PatchEmbedding(nn.Layer):
+    def __init__(self, image_size, patch_size, in_channels, embed_dim, dropout=0.):
+        super().__init__()
+        self.patch_embedding = nn.Conv2D(in_channels=in_channels,
+                                         out_channels=embed_dim,
+                                         kernel_size=patch_size,
+                                         stride=patch_size,
+                                         weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                                         bias_attr=False)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        # [n, c, h, w]
+        x = self.patch_embedding(x) # [n, c', h', w']
+        x = x.flatten(2) # [n, c', h'*w']
+        x = x.transpose([0, 2, 1]) # [n, h'*w', c']
+        x = self.dropout(x)
+        return x
+
+
+class Attention(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class EncoderLayer(nn.Layer):
+    def __init__(self, embed_dim):
+        super().__init__()
+        self.attn_norm = nn.LayerNorm(embed_dim)
+        self.attn = Attention()
+        self.mlp_norm = nn.LayerNorm(embed_dim)
+        self.mlp = Mlp(embed_dim)
+
+    def forward(self, x):
+        h = x 
+        x = self.attn_norm(x)
+        x = self.attn(x)
+        x = x + h
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = x + h
+        return x
+
+
+class ViT(nn.Layer):
+    def __init__(self):
+        super().__init__()
+        self.patch_embed = PatchEmbedding(224, 7, 3, 16)
+        layer_list = [EncoderLayer(16) for i in range(5)]
+        self.encoders = nn.LayerList(layer_list)
+        self.head = nn.Linear(16, 10)
+        self.avgpool = nn.AdaptiveAvgPool1D(1)
+
+    def forward(self, x):
+        x = self.patch_embed(x) # [n, h*w, c]: 4, 1024, 16
+        for encoder in self.encoders:
+            x = encoder(x)
+        # avg
+        x = x.transpose([0, 2, 1])
+        x = self.avgpool(x)
+        x = x.flatten(1)
+        x = self.head(x)
+        return x
+
+
+def main():
+    t = paddle.randn([4, 3, 224, 224])
+    model = ViT()
+    out = model(t)
+    print(out.shape)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class10/main.py b/edu/class10/main.py
new file mode 100644
index 00000000..38e43fad
--- /dev/null
+++ b/edu/class10/main.py
@@ -0,0 +1,103 @@
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from resnet import ResNet18
+from transformer import Transformer
+
+paddle.set_device('cpu')
+
+
+class PositionEmbedding(nn.Layer):
+    def __init__(self, embed_dim):
+        super().__init__()
+        self.row_embed = nn.Embedding(50, embed_dim)
+        self.col_embed = nn.Embedding(50, embed_dim)
+        
+    def forward(self, x):
+        # x: [b, feat, H, W]
+        h, w = x.shape[-2:]
+        i = paddle.arange(w)
+        j = paddle.arange(h)
+        x_embed = self.col_embed(i)
+        y_embed = self.row_embed(i)
+        pos = paddle.concat([x_embed.unsqueeze(0).expand((h, x_embed.shape[0], x_embed.shape[1])),
+                             y_embed.unsqueeze(1).expand((y_embed.shape[0], w, y_embed.shape[1]))], axis=-1)
+        pos = pos.transpose([2, 0, 1])
+        pos = pos.unsqueeze(0)
+        pos = pos.expand([x.shape[0]] + pos.shape[1::]) #[batch_size, embed_dim, h, w]
+        return pos
+
+
+class BboxEmbed(nn.Layer):
+    def __init__(self, in_dim, hidden_dim, out_dim):
+        super().__init__()
+        self.fc1 = nn.Linear(in_dim, hidden_dim)
+        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
+        self.fc3 = nn.Linear(hidden_dim, out_dim)
+        self.act = nn.ReLU()
+    
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        x = self.act(x)
+        x = self.fc3(x)
+        return x
+
+
+class DETR(nn.Layer):
+    def __init__(self, backbone, pos_embed, transformer, num_classes, num_queries):
+        super().__init__()
+        self.num_queries = num_queries
+        self.transformer = transformer
+        embed_dim = transformer.embed_dim
+
+        self.class_embed = nn.Linear(embed_dim, num_classes + 1)
+        self.bbox_embed  = BboxEmbed(embed_dim, embed_dim, 4)
+        self.query_embed = nn.Embedding(num_queries, embed_dim)
+
+        self.input_proj = nn.Conv2D(backbone.num_channels, embed_dim, kernel_size=1)
+        self.backbone = backbone
+        self.pos_embed = pos_embed
+
+    def forward(self, x):
+        print(f'----- INPUT: {x.shape}')
+        feat = self.backbone(x)
+        print(f'----- Feature after ResNet18: {feat.shape}')
+        pos_embed = self.pos_embed(feat)
+        print(f'----- Positional Embedding: {pos_embed.shape}')
+        
+        feat = self.input_proj(feat)
+        print(f'----- Feature after input_proj: {feat.shape}')
+        out, _ = self.transformer(feat, self.query_embed.weight, pos_embed)
+        print(f'----- out after transformer: {out.shape}')
+
+        out_class = self.class_embed(out)
+        out_coord = self.bbox_embed(out)
+        print(f'----- out for class: {out_class.shape}')
+        print(f'----- out for bbox: {out_coord.shape}')
+        #out_coord = F.sigmoid(out_coord)
+
+        return out_class, out_coord
+
+
+def build_detr():
+    backbone = ResNet18() 
+    transformer = Transformer()
+    pos_embed = PositionEmbedding(16)
+    detr = DETR(backbone, pos_embed, transformer, 10, 100)
+    return detr
+
+
+def main():
+    t = paddle.randn([3, 3, 224, 224])
+    model = build_detr()
+    out = model(t)
+    print(out[0].shape, out[1].shape)
+
+
+
+if __name__ == "__main__":
+    main()
+    
+
diff --git a/edu/class10/resnet.py b/edu/class10/resnet.py
new file mode 100644
index 00000000..e42c6270
--- /dev/null
+++ b/edu/class10/resnet.py
@@ -0,0 +1,106 @@
+import paddle
+import paddle.nn as nn
+
+paddle.set_device('cpu')
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+class Block(nn.Layer):
+    def __init__(self, in_dim, out_dim, stride):
+        super().__init__()
+        self.conv1 = nn.Conv2D(in_dim, out_dim, 3, stride=stride, padding=1, bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(out_dim)
+        self.conv2 = nn.Conv2D(out_dim, out_dim, 3, stride=1, padding=1, bias_attr=False)
+        self.bn2 = nn.BatchNorm2D(out_dim)
+        self.relu = nn.ReLU()
+
+        if stride == 2 or in_dim != out_dim:
+            self.downsample = nn.Sequential(*[
+                nn.Conv2D(in_dim, out_dim, 1, stride=stride),
+                nn.BatchNorm2D(out_dim)])
+        else:
+            self.downsample = Identity()
+
+    def forward(self, x):
+        h = x
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.conv2(x)
+        x = self.bn2(x)
+        identity = self.downsample(h) 
+        x = x + identity
+        x = self.relu(x)
+        return x
+
+class ResNet18(nn.Layer):
+    def __init__(self, in_dim=64, num_classes=10):
+        super().__init__()
+        self.num_channels = 512
+        self.in_dim = in_dim
+        # stem layers
+        self.conv1 = nn.Conv2D(in_channels=3,
+                               out_channels=in_dim,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(in_dim)
+        self.relu = nn.ReLU()
+        #blocks
+        self.layer1 = self._make_layer(dim=64, n_blocks=2, stride=1)
+        self.layer2 = self._make_layer(dim=128, n_blocks=2, stride=2)
+        self.layer3 = self._make_layer(dim=256, n_blocks=2, stride=2)
+        self.layer4 = self._make_layer(dim=512, n_blocks=2, stride=2)
+        # head layer
+        self.avgpool = nn.AdaptiveAvgPool2D(1)
+        self.classifier = nn.Linear(512, num_classes)
+
+    def _make_layer(self, dim, n_blocks, stride):
+        layer_list = []
+        layer_list.append(Block(self.in_dim, dim, stride=stride))
+        self.in_dim = dim
+        for i in range(1, n_blocks):
+            layer_list.append(Block(self.in_dim, dim, stride=1))
+        return nn.Sequential(*layer_list)
+
+
+    # CLASS 10: Modify the forward, remove the head and classifier
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        return x
+
+    #def forward(self, x):
+    #    x = self.conv1(x)
+    #    x = self.bn1(x)
+    #    x = self.relu(x)
+    #    x = self.layer1(x)
+    #    x = self.layer2(x)
+    #    x = self.layer3(x)
+    #    x = self.layer4(x)
+    #    x = self.forward_feature(x)
+    #    x = self.avgpool(x)
+    #    x = x.flatten(1)
+    #    x = self.classifier(x)
+    #    return x
+
+def main():
+    t = paddle.randn([4, 3, 224, 224])
+    model = ResNet18()
+    print(model)
+    out = model(t)
+    print(out.shape)
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class10/transformer.py b/edu/class10/transformer.py
new file mode 100644
index 00000000..8d97eab0
--- /dev/null
+++ b/edu/class10/transformer.py
@@ -0,0 +1,221 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https:///github.com/BR-IDL/PaddleViT)
+# 2021.11
+
+import copy
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+paddle.set_device('cpu')
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, embed_dim, mlp_ratio, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class Attention(nn.Layer):
+    """multi-head self attention"""
+    def __init__(self, embed_dim, num_heads, qkv_bias=True, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = int(embed_dim / num_heads)
+        self.all_head_dim = self.head_dim * num_heads
+        self.scales = self.head_dim ** -0.5
+
+
+        # CLASS 10: support decoder
+        self.q = nn.Linear(embed_dim,
+                           self.all_head_dim)
+        self.k = nn.Linear(embed_dim,
+                           self.all_head_dim)
+        self.v = nn.Linear(embed_dim,
+                           self.all_head_dim)
+
+
+        self.proj = nn.Linear(self.all_head_dim, embed_dim)
+        self.dropout = nn.Dropout(dropout)
+        self.attention_dropout = nn.Dropout(attention_dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multihead(self, x):
+        # x: [seq_l, batch, all_head_dim] -> [seq_l, batch, n_head, head_dim]
+        new_shape = x.shape[:-1] + [self.num_heads, self.head_dim]
+        x = x.reshape(new_shape)
+        x = x.flatten(1, 2) # merge batch and n_head:  [seq_l, batch*n_head, head_dim]
+        x = x.transpose([1, 0, 2]) #[batch * n_head, seq_l, head_dim]
+        return x
+
+    def forward(self, query, key, value):
+        lk = key.shape[0] # when enc-dec: num_patches (sequence len, token len)
+        b = key.shape[1] # when enc-dec: batch_size
+        lq = query.shape[0] # when enc-dec: num_queries
+        d = query.shape[2] # when enc-dec: embed_dim
+    
+        q = self.q(query)
+        k = self.k(key)
+        v = self.v(value)
+        q, k, v = map(self.transpose_multihead, [q, k, v])
+
+        print(f'----- ----- ----- ----- [Attn] batch={key.shape[1]}, n_head={self.num_heads}, head_dim={self.head_dim}')
+        print(f'----- ----- ----- ----- [Attn] q: {q.shape}, k: {k.shape}, v:{v.shape}')
+        attn = paddle.matmul(q, k, transpose_y=True) # q * k'
+        attn = attn * self.scales
+        attn = self.softmax(attn)
+        attn = self.attention_dropout(attn)
+        print(f'----- ----- ----- ----- [Attn] attn: {attn.shape}')
+
+        out = paddle.matmul(attn, v)
+        out = out.transpose([1, 0, 2])
+        out = out.reshape([lq, b, d])
+
+        out = self.proj(out)
+        out = self.dropout(out)
+
+        return out
+
+
+class EncoderLayer(nn.Layer):
+    def __init__(self, embed_dim=768, num_heads=4, mlp_ratio=4.0):
+        super().__init__()
+        self.attn_norm = nn.LayerNorm(embed_dim)
+        self.attn = Attention(embed_dim, num_heads)
+        self.mlp_norm = nn.LayerNorm(embed_dim)
+        self.mlp = Mlp(embed_dim, mlp_ratio)
+
+    def forward(self, x, pos=None):
+
+        h = x  
+        x = self.attn_norm(x)
+        q = x + pos if pos is not None else x
+        k = x + pos if pos is not None else x
+        print(f'----- ----- ----- encoder q: {q.shape}, k: {k.shape}, v:{x.shape}')
+        x = self.attn(q, k, x)
+        x = x + h
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = x + h
+        print(f'----- ----- ----- encoder out: {x.shape}')
+        return x
+
+
+class DecoderLayer(nn.Layer):
+    def __init__(self, embed_dim=768, num_heads=4, mlp_ratio=4.0):
+        super().__init__()
+        self.attn_norm = nn.LayerNorm(embed_dim)
+        self.attn = Attention(embed_dim, num_heads)
+        self.enc_dec_attn_norm = nn.LayerNorm(embed_dim)
+        self.enc_dec_attn = Attention(embed_dim, num_heads)
+        self.mlp_norm = nn.LayerNorm(embed_dim)
+        self.mlp = Mlp(embed_dim, mlp_ratio)
+
+    def forward(self, x, enc_out, pos=None, query_pos=None):
+
+        h = x  
+        x = self.attn_norm(x)
+        q = x + query_pos if pos is not None else x
+        k = x + query_pos if pos is not None else x
+        print(f'----- ----- ----- decoder(self-attn) q: {q.shape}, k: {k.shape}, v:{x.shape}')
+        x = self.attn(q, k, x)
+        x = x + h
+
+        h = x  
+        x = self.enc_dec_attn_norm(x)
+        q = x + query_pos if pos is not None else x
+        k = enc_out + pos if pos is not None else x
+        v = enc_out
+        print(f'----- ----- ----- decoder(enc-dec attn) q: {q.shape}, k: {k.shape}, v:{v.shape}')
+        x = self.attn(q, k, v)
+        x = x + h
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = x + h
+        print(f'----- ----- ----- decoder out: {x.shape}')
+        return x
+
+
+class Transformer(nn.Layer):
+    def __init__(self, embed_dim=32, num_heads=4, num_encoders=2, num_decoders=2):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.encoder = nn.LayerList([EncoderLayer(embed_dim, num_heads) for i in range(num_encoders)])
+        self.decoder = nn.LayerList([DecoderLayer(embed_dim, num_heads) for i in range(num_decoders)])
+        self.encoder_norm = nn.LayerNorm(embed_dim)
+        self.decoder_norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x, query_embed, pos_embed):
+        B, C, H, W = x.shape
+        print(f'----- ----- Transformer INPUT: {x.shape}')
+        x = x.flatten(2) #[B, C, H*W]
+        x = x.transpose([2, 0, 1]) # [H*W, B, C]
+        print(f'----- ----- Transformer INPUT(after reshape): {x.shape}')
+
+        # [B, dim, H, W]
+        pos_embed = pos_embed.flatten(2)
+        pos_embed = pos_embed.transpose([2, 0, 1]) #[H*W, B, dim]
+        print(f'----- ----- pos_embed(after reshape): {pos_embed.shape}')
+
+        # [num_queries, dim]
+        query_embed = query_embed.unsqueeze(1)
+        query_embed = query_embed.expand((query_embed.shape[0], B, query_embed.shape[2]))
+        print(f'----- ----- query_embed(after reshape): {query_embed.shape}')
+
+        target = paddle.zeros_like(query_embed)
+        print(f'----- ----- target (now all zeros): {target.shape}')
+
+        for encoder_layer in self.encoder:
+            encoder_out = encoder_layer(x, pos_embed)
+        encoder_out = self.encoder_norm(encoder_out)
+        print(f'----- ----- encoder out: {encoder_out.shape}')
+
+        for decoder_layer in self.decoder:
+            decoder_out = decoder_layer(target,
+                                        encoder_out,
+                                        pos_embed,
+                                        query_embed)
+        decoder_out = self.decoder_norm(decoder_out)
+        decoder_out = decoder_out.unsqueeze(0)
+        print(f'----- ----- decoder out: {decoder_out.shape}')
+
+
+        decoder_out = decoder_out.transpose([0, 2, 1, 3]) #[1, B, num_queries, embed_dim]
+        encoder_out = encoder_out.transpose([1, 2, 0])
+        encoder_out = encoder_out.reshape([B, C, H, W])
+        print(f'----- ----- decoder out(after reshape): {decoder_out.shape}')
+
+        return decoder_out, encoder_out
+
+
+def main():
+    trans = Transformer()
+    print(trans)
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/edu/class2/attention.py b/edu/class2/attention.py
new file mode 100644
index 00000000..869022b8
--- /dev/null
+++ b/edu/class2/attention.py
@@ -0,0 +1,71 @@
+import paddle
+import paddle.nn as nn
+
+paddle.set_device('cpu')
+
+class Attention(nn.Layer):
+    def __init__(self, embed_dim, num_heads,
+                 qkv_bias=False, qk_scale=None, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.head_dim = int(embed_dim / num_heads)
+        self.all_head_dim = self.head_dim * num_heads
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_dim * 3,
+                             bias_attr=False if qkv_bias is False else None)
+        self.scale = self.head_dim ** -0.5 if qk_scale is None else qk_scale
+        self.dropout = nn.Dropout(dropout)
+        self.attention_dropout = nn.Dropout(attention_dropout)
+
+        self.proj = nn.Linear(self.all_head_dim, embed_dim)
+        self.softmax = nn.Softmax(-1)
+
+    def transpose_multi_head(self, x):
+        # x:[n, num_patches, all_head_dim]
+        new_shape = x.shape[:-1] + [self.num_heads, self.head_dim]
+        x = x.reshape(new_shape)
+        # x:[n, num_patches, num_heads, head_dim]
+        x = x.transpose([0, 2, 1, 3])
+        # x:[n, num_heads, num_patches, head_dim]
+        return x
+
+    def forward(self, x):
+        B, N, _ = x.shape
+        # x: [n, num_patches, embed_dim]
+        qkv = self.qkv(x).chunk(3, -1)
+        # qkv:  [n, num_patches, all_head_dim] * 3
+        q, k, v = map(self.transpose_multi_head, qkv)
+        # q, k, v:[n, num_heads, num_patches, head_dim]
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = self.scale * attn
+        attn = self.softmax(attn)
+        attn_weights = attn
+        attn = self.attention_dropout(attn)
+        # attn: [n, num_heads, num_patches, num_patches]
+    
+        out = paddle.matmul(attn, v)
+        # out: [n, num_heads, num_patches, head_dim]
+        out = out.transpose([0, 2, 1, 3])
+        # out: [n, num_patches, num_heads, head_dim]
+        out = out.reshape([B, N, -1])
+
+        out = self.proj(out)
+        out = self.dropout(out)
+        return out, attn_weights
+
+def main():
+    t = paddle.randn([4, 16, 96])
+    print('input shape = ', t.shape)
+
+    model = Attention(embed_dim=96, num_heads=8, 
+                      qkv_bias=False, qk_scale=None, dropout=0., attention_dropout=0.)
+    print(model)
+
+    out, attn_weights = model(t)
+    print(out.shape)
+    print(attn_weights.shape)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class3/vit_answer.py b/edu/class3/vit_answer.py
new file mode 100644
index 00000000..2d033412
--- /dev/null
+++ b/edu/class3/vit_answer.py
@@ -0,0 +1,191 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https:///github.com/BR-IDL/PaddleViT)
+# 2021.11
+
+import copy
+import paddle
+import paddle.nn as nn
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, embed_dim, mlp_ratio, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class PatchEmbedding(nn.Layer):
+    def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768, dropout=0.):
+        super().__init__()
+        n_patches = (image_size // patch_size) * (image_size // patch_size)
+        self.patch_embedding = nn.Conv2D(in_channels=in_channels,
+                                         out_channels=embed_dim,
+                                         kernel_size=patch_size,
+                                         stride=patch_size)
+
+        self.position_embeddings = paddle.create_parameter(
+            shape=[1, n_patches + 1, embed_dim],
+            dtype='float32',
+            default_initializer=nn.initializer.TruncatedNormal(std=.02))
+
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 1, embed_dim],
+            dtype='float32',
+            default_initializer=nn.initializer.Constant(0))
+
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        # [n, c, h, w]
+        cls_tokens = self.cls_token.expand((x.shape[0], -1, -1))
+        x = self.patch_embedding(x) # [n, c', h', w']
+        x = x.flatten(2) # [n, c', h'*w']
+        x = x.transpose([0, 2, 1]) # [n, h'*w', c']
+        x = paddle.concat((cls_tokens, x), axis=1)
+
+        embeddings = x + self.position_embeddings
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class Attention(nn.Layer):
+    """multi-head self attention"""
+    def __init__(self, embed_dim, num_heads, qkv_bias=True, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = int(embed_dim / num_heads)
+        self.all_head_dim = self.head_dim * num_heads
+        self.scales = self.head_dim ** -0.5
+
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_dim * 3)
+
+        self.proj = nn.Linear(embed_dim, embed_dim)
+
+        self.dropout = nn.Dropout(dropout)
+        self.attention_dropout = nn.Dropout(attention_dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multihead(self, x):
+        # x: [N, num_patches, all_head_dim] -> [N, n_heads, num_patches, head_dim]
+        new_shape = x.shape[:-1] + [self.num_heads, self.head_dim]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def forward(self, x):
+        B, N, _ = x.shape
+        # x -> [N, num_patches, dim]
+        # x -> q, k, v
+        qkv = self.qkv(x).chunk(3, axis=-1) # list of tensors
+        q, k, v = map(self.transpose_multihead, qkv)
+
+        attn = paddle.matmul(q, k, transpose_y=True) # q * k'
+        attn = attn * self.scales
+        attn = self.softmax(attn)
+        attn = self.attention_dropout(attn)
+
+        out = paddle.matmul(attn, v)
+        out = out.transpose([0, 2, 1, 3])
+        out = out.reshape([B, N, -1])
+
+        out = self.proj(out)
+        out = self.dropout(out)
+
+        return out
+
+
+
+class EncoderLayer(nn.Layer):
+    def __init__(self, embed_dim=768, num_heads=4, qkv_bias=True, mlp_ratio=4.0, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.attn_norm = nn.LayerNorm(embed_dim)
+        self.attn = Attention(embed_dim, num_heads)
+        self.mlp_norm = nn.LayerNorm(embed_dim)
+        self.mlp = Mlp(embed_dim, mlp_ratio)
+
+    def forward(self, x):
+        h = x 
+        x = self.attn_norm(x)
+        x = self.attn(x)
+        x = x + h
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = x + h
+        return x
+
+
+class Encoder(nn.Layer):
+    def __init__(self, embed_dim, depth):
+        super().__init__()
+        layer_list = []
+        for i in range(depth):
+            encoder_layer = EncoderLayer()
+            layer_list.append(encoder_layer)
+        self.layers = nn.LayerList(layer_list)
+
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+
+        x = self.norm(x)
+        return x
+
+
+class VisualTransformer(nn.Layer):
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=3,
+                 num_heads=8,
+                 mlp_ratio=4,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.patch_embedding = PatchEmbedding(image_size, patch_size, in_channels, embed_dim)
+        self.encoder = Encoder(embed_dim, depth)
+        self.classifier = nn.Linear(embed_dim, num_classes)
+
+    def forward(self, x):
+        x = self.patch_embedding(x)
+        x = self.encoder(x)
+        x = self.classifier(x[:, 0])
+
+        return x
+
+
+def main():
+    vit = VisualTransformer()
+    print(vit)
+    paddle.summary(vit, (4, 3, 224, 224)) # must be tuple
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class3/vit_homework.py b/edu/class3/vit_homework.py
new file mode 100644
index 00000000..127ce9ff
--- /dev/null
+++ b/edu/class3/vit_homework.py
@@ -0,0 +1,139 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https:///github.com/BR-IDL/PaddleViT)
+# 2021.11
+
+import copy
+import paddle
+import paddle.nn as nn
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, embed_dim, mlp_ratio, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        # TODO
+
+
+class PatchEmbedding(nn.Layer):
+    def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768, dropout=0.):
+        super().__init__()
+        n_patches = (image_size // patch_size) * (image_size // patch_size)
+        self.patch_embedding = nn.Conv2D(in_channels=in_channels,
+                                         out_channels=embed_dim,
+                                         kernel_size=patch_size,
+                                         stride=patch_size)
+        self.dropout = nn.Dropout(dropout)
+
+        # TODO: add class token
+
+        # TODO: add position embedding
+
+
+    def forward(self, x):
+        # [n, c, h, w]
+        # TODO: forward
+
+
+class Attention(nn.Layer):
+    """multi-head self attention"""
+    def __init__(self, embed_dim, num_heads, qkv_bias=True, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = int(embed_dim / num_heads)
+        self.all_head_dim = self.head_dim * num_heads
+        self.scales = self.head_dim ** -0.5
+
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_dim * 3)
+
+        self.proj = nn.Linear(embed_dim, embed_dim)
+
+        self.dropout = nn.Dropout(dropout)
+        self.attention_dropout = nn.Dropout(attention_dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multihead(self, x):
+        # x: [N, num_patches, all_head_dim] -> [N, n_heads, num_patches, head_dim]
+        new_shape = x.shape[:-1] + [self.num_heads, self.head_dim]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def forward(self, x):
+        # TODO
+
+
+
+
+class EncoderLayer(nn.Layer):
+    def __init__(self, embed_dim=768, num_heads=4, qkv_bias=True, mlp_ratio=4.0, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.attn_norm = nn.LayerNorm(embed_dim)
+        self.attn = Attention(embed_dim, num_heads)
+        self.mlp_norm = nn.LayerNorm(embed_dim)
+        self.mlp = Mlp(embed_dim, mlp_ratio)
+
+    def forward(self, x):
+        # TODO
+
+
+class Encoder(nn.Layer):
+    def __init__(self, embed_dim, depth):
+        super().__init__()
+        layer_list = []
+        for i in range(depth):
+            encoder_layer = EncoderLayer()
+            layer_list.append(encoder_layer)
+        self.layers = nn.LayerList(layer_list)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        # TODO
+        
+
+
+class VisualTransformer(nn.Layer):
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=3,
+                 num_heads=8,
+                 mlp_ratio=4,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.patch_embedding = PatchEmbedding(image_size, patch_size, in_channels, embed_dim)
+        self.encoder = Encoder(embed_dim, depth)
+        self.classifier = nn.Linear(embed_dim, num_classes)
+
+    def forward(self, x):
+        # TODO: forward
+
+
+def main():
+    vit = VisualTransformer()
+    print(vit)
+    paddle.summary(vit, (4, 3, 224, 224)) # must be tuple
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/edu/class4/deit.py b/edu/class4/deit.py
new file mode 100644
index 00000000..d4d8d81c
--- /dev/null
+++ b/edu/class4/deit.py
@@ -0,0 +1,196 @@
+# ViT Online Class
+# Author: Dr. Zhu
+# Project: PaddleViT (https:///github.com/BR-IDL/PaddleViT)
+# 2021.11
+
+import copy
+import paddle
+import paddle.nn as nn
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, embed_dim, mlp_ratio, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class PatchEmbedding(nn.Layer):
+    def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768, dropout=0.):
+        super().__init__()
+        n_patches = (image_size // patch_size) * (image_size // patch_size)
+        self.patch_embedding = nn.Conv2D(in_channels=in_channels,
+                                         out_channels=embed_dim,
+                                         kernel_size=patch_size,
+                                         stride=patch_size)
+
+        self.position_embeddings = paddle.create_parameter(
+            shape=[1, n_patches + 2, embed_dim],
+            dtype='float32',
+            default_initializer=nn.initializer.TruncatedNormal(std=.02))
+
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 1, embed_dim],
+            dtype='float32',
+            default_initializer=nn.initializer.Constant(0))
+
+        self.distill_token = paddle.create_parameter(
+            shape=[1, 1, embed_dim],
+            dtype='float32',
+            default_initializer=nn.initializer.TruncatedNormal(std=.02))
+
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        # [n, c, h, w]
+        cls_tokens = self.cls_token.expand((x.shape[0], -1, -1))
+        distill_tokens = self.distill_token.expand((x.shape[0], -1, -1))
+        x = self.patch_embedding(x) # [n, c', h', w']
+        x = x.flatten(2) # [n, c', h'*w']
+        x = x.transpose([0, 2, 1]) # [n, h'*w', c']
+        x = paddle.concat((cls_tokens, distill_tokens, x), axis=1)
+
+        embeddings = x + self.position_embeddings
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class Attention(nn.Layer):
+    """multi-head self attention"""
+    def __init__(self, embed_dim, num_heads, qkv_bias=True, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = int(embed_dim / num_heads)
+        self.all_head_dim = self.head_dim * num_heads
+        self.scales = self.head_dim ** -0.5
+
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_dim * 3)
+
+        self.proj = nn.Linear(embed_dim, embed_dim)
+
+        self.dropout = nn.Dropout(dropout)
+        self.attention_dropout = nn.Dropout(attention_dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multihead(self, x):
+        # x: [N, num_patches, all_head_dim] -> [N, n_heads, num_patches, head_dim]
+        new_shape = x.shape[:-1] + [self.num_heads, self.head_dim]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def forward(self, x):
+        # x -> [N, num_patches, dim]
+        # x -> q, k, v
+        qkv = self.qkv(x).chunk(3, axis=-1) # list of tensors
+        q, k, v = map(self.transpose_multihead, qkv)
+
+        attn = paddle.matmul(q, k, transpose_y=True) # q * k'
+        attn = attn * self.scales
+        attn = self.softmax(attn)
+        attn = self.attention_dropout(attn)
+
+        out = paddle.matmul(attn, v)
+        out = out.transpose([0, 2, 1, 3])
+        out = out.reshape(out.shape[:-2]+[self.all_head_dim])
+
+        out = self.proj(out)
+        out = self.dropout(out)
+
+        return out
+
+
+class EncoderLayer(nn.Layer):
+    def __init__(self, embed_dim=768, num_heads=4, qkv_bias=True, mlp_ratio=4.0, dropout=0., attention_dropout=0.):
+        super().__init__()
+        self.attn_norm = nn.LayerNorm(embed_dim)
+        self.attn = Attention(embed_dim, num_heads)
+        self.mlp_norm = nn.LayerNorm(embed_dim)
+        self.mlp = Mlp(embed_dim, mlp_ratio)
+
+    def forward(self, x):
+        h = x 
+        x = self.attn_norm(x)
+        x = self.attn(x)
+        x = x + h
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = x + h
+        return x
+
+
+class Encoder(nn.Layer):
+    def __init__(self, embed_dim, depth):
+        super().__init__()
+        layer_list = []
+        for i in range(depth):
+            encoder_layer = EncoderLayer()
+            layer_list.append(encoder_layer)
+
+        self.layers = nn.LayerList(layer_list)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        x = self.norm(x)
+        return x[:, 0], x[:, 1]
+
+
+class DeiT(nn.Layer):
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=3,
+                 num_heads=8,
+                 mlp_ratio=4,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.patch_embedding = PatchEmbedding(224, 16, 3, 768)
+        self.encoder = Encoder(embed_dim, depth)
+        self.head = nn.Linear(embed_dim, num_classes)
+        self.head_distill = nn.Linear(embed_dim, num_classes)
+
+    def forward(self, x):
+        x = self.patch_embedding(x)
+        x, x_distill = self.encoder(x)
+        x = self.head(x)
+        x_distill = self.head_distill(x_distill)
+        if self.training:
+            return x, x_distill
+        return (x + x_distill) / 2
+
+
+def main():
+    model = DeiT()
+    print(model)
+    paddle.summary(model, (4, 3, 224, 224)) # must be tuple
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class4/transforms.py b/edu/class4/transforms.py
new file mode 100644
index 00000000..4432d1f1
--- /dev/null
+++ b/edu/class4/transforms.py
@@ -0,0 +1,66 @@
+import numpy as np
+from PIL import Image
+import paddle
+import paddle.vision.transforms as T
+paddle.set_device('cpu')
+
+
+def crop(image, region):
+    # region: [i, j, h, w]
+    cropped_image = T.crop(image, *region)
+    return cropped_image
+
+
+class CenterCrop():
+    def __init__(self, size):
+        self.size = size
+
+    def __call__(self, image):
+        w, h = image.size
+        ch, cw = self.size
+        crop_top = int(round(h - ch) / 2.)
+        crop_left = int(round(w - cw) / 2.)
+        return crop(image, (crop_top, crop_left, ch, cw))
+
+
+class Resize():
+    def __init__(self, size):
+        self.size = size
+
+    def __call__(self, image):
+        return T.resize(image, self.size)
+
+
+class ToTensor():
+    def __init__(self):
+        pass
+
+    def __call__(self, image):
+        w, h = image.size
+        img = paddle.to_tensor(np.array(image))
+        if img.dtype == paddle.uint8:
+            img = paddle.cast(img, 'float32') /255.
+        img = img.transpose([2, 0, 1]) # 'CHW'
+        return img
+
+
+class Compose():
+    def __init__(self, transforms):
+        self.transforms = transforms
+
+    def __call__(self, image):
+        for t in self.transforms:
+            image = t(image)
+        return image
+    
+
+def main():
+    img = Image.open('img.jpg')
+    transforms = Compose([Resize([256, 256]), CenterCrop([112, 112])])
+    #transforms = Compose([Resize([256, 256]), CenterCrop([112, 112]), ToTensor()])
+    out = transforms(img)
+    print(out)
+    out.save('img_crop.jpg')
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class5/main.py b/edu/class5/main.py
new file mode 100644
index 00000000..daa88988
--- /dev/null
+++ b/edu/class5/main.py
@@ -0,0 +1,173 @@
+import paddle
+import paddle.nn as nn
+
+class PatchEmbedding(nn.Layer):
+    def __init__(self, patch_size=4, embed_dim=96):
+        super().__init__()
+        self.patch_embed = nn.Conv2D(3,
+                                     embed_dim,
+                                     kernel_size=patch_size,
+                                     stride=patch_size)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        x = self.patch_embed(x) # [n, embed_dim, h, w]
+        x = x.flatten(2) #[n, embed_dim, h*w]
+        x = x.transpose([0, 2, 1]) #[n, h*w, embed_dim]
+        x = self.norm(x) #[n, num_patches, embed_dim]
+        return x
+
+
+class PatchMerging(nn.Layer):
+    def __init__(self, input_resolution, dim):
+        super().__init__()
+        self.resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim)
+        self.norm = nn.LayerNorm(4 * dim)
+
+    def forward(self, x):
+        h, w = self.resolution
+        b, _, c = x.shape # [n, num_patches, dim]
+
+        x = x.reshape([b, h, w, c])
+
+        x0 = x[:, 0::2, 0::2, :]
+        x1 = x[:, 1::2, 0::2, :]
+        x2 = x[:, 0::2, 1::2, :]
+        x3 = x[:, 1::2, 1::2, :] # [b, h/2, w/2, c]
+
+        x = paddle.concat([x0, x1, x2, x3], axis=-1) #[b, h/2, w/2, 4c]
+        x = x.reshape([b, -1, 4*c]) #[b, h*w/4, 4c]
+        x = self.norm(x)
+        x = self.reduction(x)
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, dim, mlp_ratio=4.0, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(dim, int(dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(dim * mlp_ratio), dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+def windows_partition(x, window_size):
+    B, H, W, C = x.shape
+    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C])
+    x = x.transpose([0, 1, 3, 2, 4, 5]) #[B, h//ws, w//ws, ws, ws, c]
+    x = x.reshape([-1, window_size, window_size, C]) # [B * num_window, ws, ws, c]
+    return x
+
+
+def windows_reverse(windows, window_size, H, W):
+    # windows: [B*num_windows, ws*ws, c]
+    B = int(windows.shape[0] // (H / window_size * W / window_size))
+    x = windows.reshape([B, H // window_size, W // window_size, window_size, window_size, -1])
+    x = x.transpose([0, 1, 3, 2, 4, 5])
+    x = x.reshape([B, H, W, -1])
+    return x
+
+
+class WindowAttention(nn.Layer):
+    def __init__(self, dim, window_size, num_heads):
+        super().__init__()
+        self.dim = dim
+        self.dim_head = dim // num_heads
+        self.num_heads = num_heads
+        self.scale = self.dim_head ** -0.5
+        self.qkv = nn.Linear(dim,
+                             dim * 3)
+        self.proj = nn.Linear(dim, dim)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multi_head(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3]) #[B, num_heads, num_patches, dim_head]
+        return x
+
+    def forward(self, x):
+        B, N, d = x.shape
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multi_head, qkv)
+        q = q * self.scale
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = self.softmax(attn)
+
+        z = paddle.matmul(attn, v) #[N, num_heads, num_patches, dim_head]
+        z = z.transpose([0, 2, 1, 3]) #[N, num_patches, num_heads, dim_head]
+        z = z.reshape([B, N, d])
+        z = self.proj(z)
+        return z
+
+
+class SwinBlock(nn.Layer):
+    def __init__(self, dim, input_resolution, num_heads, window_size):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        
+        self.attn_norm = nn.LayerNorm(dim)
+        self.attn = WindowAttention(dim=dim, window_size=window_size, num_heads=num_heads)
+
+        self.mlp_norm = nn.LayerNorm(dim)
+        self.mlp = Mlp(dim)
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, N, C = x.shape
+
+        h = x
+        x = self.attn_norm(x)
+
+        x = x.reshape([B, H, W, C])
+        x_windows = windows_partition(x, self.window_size)
+        # x_windows: [B*num_windows, ws, ws, C]
+        x_windows = x_windows.reshape([-1, self.window_size * self.window_size, C])
+
+        attn_windows = self.attn(x_windows)
+        # [N', num_patches, C]
+        attn_windows = attn_windows.reshape([-1, self.window_size, self.window_size, C])
+        # [N', ws, ws, C]
+        x = windows_reverse(attn_windows, self.window_size, H, W)
+        # [B, H, W, C]
+
+        x = x.reshape([B, H*W, C])
+        x = h + x
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = h + x
+        return x
+
+
+def main():
+    t = paddle.randn([4, 3, 224, 224])
+    patch_embedding = PatchEmbedding(patch_size=4, embed_dim=96)
+    swin_block = SwinBlock(dim=96, input_resolution=[56, 56], num_heads=4, window_size=7)
+    patch_merging = PatchMerging(input_resolution=[56, 56], dim=96)
+
+    out = patch_embedding(t)
+    print('patch_embedding out shape = ', out.shape) # [4, 3136, 96], 56*56 = 3136
+    out = swin_block(out)
+    print('swin_block out shape = ', out.shape)
+    out = patch_merging(out)
+    print('patch_merging out shape = ', out.shape) # [4, 784, 192] # 784 = 28 * 28
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class6/main.py b/edu/class6/main.py
new file mode 100644
index 00000000..7fcb2891
--- /dev/null
+++ b/edu/class6/main.py
@@ -0,0 +1,265 @@
+import paddle
+import paddle.nn as nn
+from mask import generate_mask
+paddle.set_device('cpu')
+
+# CLASS 5
+class PatchEmbedding(nn.Layer):
+    def __init__(self, patch_size=4, embed_dim=96):
+        super().__init__()
+        self.patch_embed = nn.Conv2D(3, embed_dim, kernel_size=patch_size, stride=patch_size)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        x = self.patch_embed(x) # [n, embed_dim, h', w']
+        x = x.flatten(2) # [n, embed_dim, h'*w']
+        x = x.transpose([0, 2, 1]) #[n, h'*w', embed_dim]
+        x = self.norm(x) #[n, num_patches, embed_dim]
+        return x
+
+
+# CLASS 5
+class PatchMerging(nn.Layer):
+    def __init__(self, input_resolution, dim):
+        super().__init__()
+        self.resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim)
+        self.norm = nn.LayerNorm(4 * dim)
+
+    def forward(self, x):
+        h, w = self.resolution
+        b, _, c = x.shape  # [n, num_patches, embed_dim]
+        x = x.reshape([b, h, w, c])
+
+        x0 = x[:, 0::2, 0::2, :]
+        x1 = x[:, 1::2, 0::2, :]
+        x2 = x[:, 0::2, 1::2, :]
+        x3 = x[:, 1::2, 1::2, :] #[b, h/2, w/2, c]
+        
+        x = paddle.concat([x0, x1, x2, x3], axis=-1) #[b, h/2, w/2, 4c]
+        x = x.reshape([b, -1, 4 * c]) #[b, h*2 / 4, 4c]
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+
+# CLASS 5
+class Mlp(nn.Layer):
+    def __init__(self, dim, mlp_ratio=4.0, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(dim, int(dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(dim * mlp_ratio), dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+# CLASS 5
+def windows_partition(x, window_size):
+    B, H, W, C = x.shape
+    # B, H/ws, ws, W/ws, ws, C
+    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C])
+    # B, H/ws, W/ws, ws, ws, c
+    x = x.transpose([0, 1, 3, 2, 4, 5])
+    # B * H/ws * W/ws, ws, ws, c
+    x = x.reshape([-1, window_size, window_size, C]) # [B*num_windows, ws, ws, C]
+    return x
+
+
+# CLASS 5
+def windows_reverse(windows, window_size, H, W):
+    # windows: [B*num_windows, ws*ws, C]
+    B = int(windows.shape[0] // ( H / window_size * W / window_size))
+    x = windows.reshape([B, H//window_size, W//window_size, window_size, window_size, -1])
+    x = x.transpose([0, 1, 3, 2, 4, 5]) # [B, H/ws, ws, W/ws, ws, C]
+    x = x.reshape([B, H, W, -1]) #[B, H, W, C]
+    return x
+
+
+# CLASS 6
+class WindowAttention(nn.Layer):
+    def __init__(self, dim, window_size, num_heads):
+        super().__init__()
+        self.dim = dim
+        self.dim_head = dim // num_heads
+        self.num_heads = num_heads
+        self.scale = self.dim_head ** -0.5
+        self.softmax = nn.Softmax(axis=-1)
+        self.qkv = nn.Linear(dim,
+                             dim * 3)
+        self.proj = nn.Linear(dim, dim)
+
+        ###### BEGIN Class 6: Relative Position Bias
+        self.window_size = window_size
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=[(2*window_size-1)*(2*window_size-1), num_heads],
+            dtype='float32',
+            default_initializer=nn.initializer.TruncatedNormal(std=.02))
+        coord_h = paddle.arange(self.window_size)
+        coord_w = paddle.arange(self.window_size)
+        coords = paddle.stack(paddle.meshgrid([coord_h, coord_w])) #[2, ws, ws]
+        coords = coords.flatten(1) #[2, ws*ws]
+        relative_coords = coords.unsqueeze(2) - coords.unsqueeze(1)
+        relative_coords = relative_coords.transpose([1, 2, 0])
+        relative_coords[:, :, 0] += self.window_size - 1
+        relative_coords[:, :, 1] += self.window_size - 1
+
+        relative_coords[:, :, 0] *= 2*self.window_size - 1
+        relative_coords_index = relative_coords.sum(2)
+        print(relative_coords_index)
+        self.register_buffer('relative_coords_index', relative_coords_index)
+        ###### END Class 6: Relative Position Bias
+
+    ###### BEGIN Class 6: Relative Position Bias
+    def get_relative_position_bias_from_index(self):
+        table = self.relative_position_bias_table  # [2m-1 * 2m-1, num_heads]
+        index = self.relative_coords_index.reshape([-1]) # [M^2, M^2] - > [M^2*M^2]
+        relative_position_bias = paddle.index_select(x=table, index=index) # [M*M, M*M, num_heads]
+        return relative_position_bias
+    ###### END Class 6: Relative Position Bias
+
+    def transpose_multi_head(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3]) #[B, num_heads, num_patches, dim_head]
+        return x
+
+    # CLASS 6
+    def forward(self, x, mask=None):
+        # x: [B*num_windows, window_size, window_size, c]   num_patches = windows_size * window_size
+        B, N, C = x.shape
+        print('xshape=', x.shape)
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multi_head, qkv)
+        q = q * self.scale
+        attn = paddle.matmul(q, k, transpose_y=True)
+        # [B*num_windows, num_heads, num_patches, num_patches]  num_patches = windows_size * window_size = M * M 
+
+        print('attn shape=', attn.shape)
+        ###### BEGIN Class 6: Relative Position Bias
+        relative_position_bias = self.get_relative_position_bias_from_index()
+        relative_position_bias = relative_position_bias.reshape(
+            [self.window_size * self.window_size, self.window_size * self.window_size, -1])
+        # [num_patches, num_patches, num_heads]
+        relative_position_bias = relative_position_bias.transpose([2, 0, 1]) #[num_heads, num_patches, num_patches]
+        # attn: [B*num_windows, num_heads, num_patches, num_patches]
+        attn = attn + relative_position_bias.unsqueeze(0)
+        ###### END Class 6: Relative Position Bias
+
+        ###### BEGIN Class 6: Mask
+        if mask is None:
+            attn = self.softmax(attn)
+        else:
+            # mask: [num_windows, num_patches, num_patches]
+            # attn: [B*num_windows, num_heads, num_patches, num_patches]
+            attn = attn.reshape([x.shape[0]//mask.shape[0], mask.shape[0], self.num_heads, mask.shape[1], mask.shape[1]]) 
+            # attn: [B, num_windows, num_heads, num_patches, num_patches]
+            attn = attn + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.reshape([-1, self.num_heads, mask.shape[1], mask.shape[1]])
+            # attn: [B*num_windows, num_heads, num_patches, num_patches]
+            attn = self.softmax(attn)
+        ###### END Class 6: Mask
+
+        out = paddle.matmul(attn, v)
+        out = out.transpose([0, 2, 1, 3])
+        out = out.reshape([B, N, C])
+        out = self.proj(out)
+        return out
+
+
+# CLASS 5   
+class SwinBlock(nn.Layer):
+    def __init__(self, dim, input_resolution, num_heads, window_size, shift_size=0):
+        super().__init__()
+        self.dim =dim
+        self.resolution = input_resolution
+        self.window_size = window_size
+        self.shift_size = shift_size
+
+        self.attn_norm = nn.LayerNorm(dim)
+        self.attn = WindowAttention(dim, window_size, num_heads)
+
+        self.mlp_norm = nn.LayerNorm(dim)
+        self.mlp = Mlp(dim)
+
+        # CLASS 6
+        if self.shift_size > 0:
+            attn_mask = generate_mask(window_size=self.window_size,
+                                      shift_size=self.shift_size,
+                                      input_resolution=self.resolution)
+        else:
+            attn_mask = None
+        self.register_buffer('attn_mask', attn_mask)
+
+    def forward(self, x):
+        H, W = self.resolution
+        B, N, C = x.shape
+
+        h = x
+        x = self.attn_norm(x)
+
+        x = x.reshape([B, H, W, C])
+
+        ##### BEGIN CLASS 6
+        if self.shift_size > 0:
+            shifted_x = paddle.roll(x, shifts=(-self.shift_size, -self.shift_size), axis=(1, 2))
+        else:
+            shifted_x = x
+
+        x_windows = windows_partition(shifted_x, self.window_size)
+        # x_windows: [B*num_windows, window_size, window_size, c]
+        x_windows = x_windows.reshape([-1, self.window_size * self.window_size, C])
+        # x_windows: [B*num_windows, window_size*window_size, c]
+        attn_windows = self.attn(x_windows, mask=self.attn_mask)
+        # attn_windows: [B*num_windows, window_size*window_size, c]
+        attn_windows = attn_windows.reshape([-1, self.window_size, self.window_size, C])
+        # attn_windows: [B*num_windows, window_size, window_size, c]
+        shifted_x = windows_reverse(attn_windows, self.window_size, H, W)
+        # shifted_x: [B, H, W, C]
+
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = paddle.roll(shifted_x, shifts=(self.shift_size, self.shift_size), axis=(1, 2))
+        else:
+            x = shifted_x
+        ##### END CLASS 6
+
+
+        #[B, H, W, C]
+        x = x.reshape([B, H*W, C])
+        x = h + x
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = h + x
+        return x
+
+    
+def main():
+    t = paddle.randn((4, 3, 224, 224))
+    patch_embedding = PatchEmbedding(patch_size=4, embed_dim=96)
+    swin_block_w_msa = SwinBlock(dim=96, input_resolution=[56, 56], num_heads=4, window_size=7, shift_size=0)
+    swin_block_sw_msa = SwinBlock(dim=96, input_resolution=[56, 56], num_heads=4, window_size=7, shift_size=7//2)
+    patch_merging = PatchMerging(input_resolution=[56, 56], dim=96)
+
+    print('image shape = [4, 3, 224, 224]')
+    out = patch_embedding(t) # [4, 56, 56, 96]
+    print('patch_embedding out shape = ', out.shape)
+    out = swin_block_w_msa(out)
+    out = swin_block_sw_msa(out)
+    print('swin_block out shape = ', out.shape)
+    out = patch_merging(out)
+    print('patch_merging out shape = ', out.shape)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class6/mask.py b/edu/class6/mask.py
new file mode 100644
index 00000000..5697bbe6
--- /dev/null
+++ b/edu/class6/mask.py
@@ -0,0 +1,65 @@
+import paddle
+from PIL import Image
+paddle.set_device('cpu')
+
+def windows_partition(x, window_size):
+    """ partite windows into window_size x window_size
+    Args:
+        x: Tensor, shape=[b, h, w, c]
+        window_size: int, window size
+    Returns:
+        x: Tensor, shape=[num_windows*b, window_size, window_size, c]
+    """
+
+    B, H, W, C = x.shape
+    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C])
+    x = x.transpose([0, 1, 3, 2, 4, 5])
+    x = x.reshape([-1, window_size, window_size, C]) #(num_windows*B, window_size, window_
+
+    return x
+
+
+def generate_mask(window_size=4, shift_size=2, input_resolution=(8, 8)):
+    H, W = input_resolution
+    img_mask = paddle.zeros([1, H, W, 1])
+    h_slices = [slice(0, -window_size),
+                slice(-window_size, -shift_size),
+                slice(-shift_size, None)]
+    w_slices = [slice(0, -window_size),
+                slice(-window_size, -shift_size),
+                slice(-shift_size, None)]
+    cnt = 0
+    for h in h_slices:
+        for w in w_slices:
+            img_mask[:, h, w, :] = cnt
+            cnt += 1
+   
+    windows_mask = windows_partition(img_mask, window_size=window_size)
+    windows_mask = windows_mask.reshape([-1, window_size*window_size])
+    #[num_windows, ws*ws]
+    attn_mask = windows_mask.unsqueeze(1) - windows_mask.unsqueeze(2) 
+    #[n, 1, ws*ws] - [n, ws*ws, 1] = [n, ws*ws, ws*ws]
+    attn_mask = paddle.where(attn_mask!=0,
+                             paddle.ones_like(attn_mask) * 255,
+                             paddle.zeros_like(attn_mask))
+    return attn_mask
+
+
+def main():
+    mask = generate_mask()
+    print(mask.shape)
+    mask = mask.cpu().numpy().astype('uint8')
+    for i in range(4):
+        for j in range(16):
+            for k in range(16):
+                print(mask[i,j,k], end='\t')
+            print()
+        im = Image.fromarray(mask[i, :, :])
+        im.save(f'{i}.png')
+        print()
+        print()
+    print()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class7/main.py b/edu/class7/main.py
new file mode 100644
index 00000000..4ca3ef2d
--- /dev/null
+++ b/edu/class7/main.py
@@ -0,0 +1,349 @@
+import paddle
+import paddle.nn as nn
+from mask import generate_mask
+paddle.set_device('cpu')
+
+# CLASS 1:
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+# CLASS 5
+class PatchEmbedding(nn.Layer):
+    def __init__(self, patch_size=4, embed_dim=96):
+        super().__init__()
+        self.patch_embed = nn.Conv2D(3, embed_dim, kernel_size=patch_size, stride=patch_size)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        x = self.patch_embed(x) # [n, embed_dim, h', w']
+        x = x.flatten(2) # [n, embed_dim, h'*w']
+        x = x.transpose([0, 2, 1]) #[n, h'*w', embed_dim]
+        x = self.norm(x) #[n, num_patches, embed_dim]
+        return x
+
+
+# CLASS 5
+class PatchMerging(nn.Layer):
+    def __init__(self, input_resolution, dim):
+        super().__init__()
+        self.resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim)
+        self.norm = nn.LayerNorm(4 * dim)
+
+    def forward(self, x):
+        h, w = self.resolution
+        b, _, c = x.shape  # [n, num_patches, embed_dim]
+        x = x.reshape([b, h, w, c])
+
+        x0 = x[:, 0::2, 0::2, :]
+        x1 = x[:, 1::2, 0::2, :]
+        x2 = x[:, 0::2, 1::2, :]
+        x3 = x[:, 1::2, 1::2, :] #[b, h/2, w/2, c]
+        
+        x = paddle.concat([x0, x1, x2, x3], axis=-1) #[b, h/2, w/2, 4c]
+        x = x.reshape([b, -1, 4 * c]) #[b, h*2 / 4, 4c]
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+
+# CLASS 5
+class Mlp(nn.Layer):
+    def __init__(self, dim, mlp_ratio=4.0, dropout=0.):
+        super().__init__()
+        self.fc1 = nn.Linear(dim, int(dim * mlp_ratio))
+        self.fc2 = nn.Linear(int(dim * mlp_ratio), dim)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+# CLASS 5
+def windows_partition(x, window_size):
+    B, H, W, C = x.shape
+    # B, H/ws, ws, W/ws, ws, C
+    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C])
+    # B, H/ws, W/ws, ws, ws, c
+    x = x.transpose([0, 1, 3, 2, 4, 5])
+    # B * H/ws * W/ws, ws, ws, c
+    x = x.reshape([-1, window_size, window_size, C]) # [B*num_windows, ws, ws, C]
+    return x
+
+
+# CLASS 5
+def windows_reverse(windows, window_size, H, W):
+    # windows: [B*num_windows, ws*ws, C]
+    B = int(windows.shape[0] // ( H / window_size * W / window_size))
+    x = windows.reshape([B, H//window_size, W//window_size, window_size, window_size, -1])
+    x = x.transpose([0, 1, 3, 2, 4, 5]) # [B, H/ws, ws, W/ws, ws, C]
+    x = x.reshape([B, H, W, -1]) #[B, H, W, C]
+    return x
+
+
+# CLASS 6
+class WindowAttention(nn.Layer):
+    def __init__(self, dim, window_size, num_heads):
+        super().__init__()
+        self.dim = dim
+        self.dim_head = dim // num_heads
+        self.num_heads = num_heads
+        self.scale = self.dim_head ** -0.5
+        self.softmax = nn.Softmax(axis=-1)
+        self.qkv = nn.Linear(dim,
+                             dim * 3)
+        self.proj = nn.Linear(dim, dim)
+
+        ###### BEGIN Class 6: Relative Position Bias
+        self.window_size = window_size
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=[(2*window_size-1)*(2*window_size-1), num_heads],
+            dtype='float32',
+            default_initializer=nn.initializer.TruncatedNormal(std=.02))
+        coord_h = paddle.arange(self.window_size)
+        coord_w = paddle.arange(self.window_size)
+        coords = paddle.stack(paddle.meshgrid([coord_h, coord_w])) #[2, ws, ws]
+        coords = coords.flatten(1) #[2, ws*ws]
+        relative_coords = coords.unsqueeze(2) - coords.unsqueeze(1)
+        relative_coords = relative_coords.transpose([1, 2, 0])
+        relative_coords[:, :, 0] += self.window_size - 1
+        relative_coords[:, :, 1] += self.window_size - 1
+
+        relative_coords[:, :, 0] *= 2*self.window_size - 1
+        relative_coords_index = relative_coords.sum(2)
+        print(relative_coords_index)
+        self.register_buffer('relative_coords_index', relative_coords_index)
+        ###### END Class 6: Relative Position Bias
+
+    ###### BEGIN Class 6: Relative Position Bias
+    def get_relative_position_bias_from_index(self):
+        table = self.relative_position_bias_table  # [2m-1 * 2m-1, num_heads]
+        index = self.relative_coords_index.reshape([-1]) # [M^2, M^2] - > [M^2*M^2]
+        relative_position_bias = paddle.index_select(x=table, index=index) # [M*M, M*M, num_heads]
+        return relative_position_bias
+    ###### END Class 6: Relative Position Bias
+
+    def transpose_multi_head(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3]) #[B, num_heads, num_patches, dim_head]
+        return x
+
+    # CLASS 6
+    def forward(self, x, mask=None):
+        # x: [B*num_windows, window_size, window_size, c]   num_patches = windows_size * window_size
+        B, N, C = x.shape
+        print('xshape=', x.shape)
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multi_head, qkv)
+        q = q * self.scale
+        attn = paddle.matmul(q, k, transpose_y=True)
+        # [B*num_windows, num_heads, num_patches, num_patches]  num_patches = windows_size * window_size = M * M 
+
+        print('attn shape=', attn.shape)
+        ###### BEGIN Class 6: Relative Position Bias
+        relative_position_bias = self.get_relative_position_bias_from_index()
+        relative_position_bias = relative_position_bias.reshape(
+            [self.window_size * self.window_size, self.window_size * self.window_size, -1])
+        # [num_patches, num_patches, num_heads]
+        relative_position_bias = relative_position_bias.transpose([2, 0, 1]) #[num_heads, num_patches, num_patches]
+        # attn: [B*num_windows, num_heads, num_patches, num_patches]
+        attn = attn + relative_position_bias.unsqueeze(0)
+        ###### END Class 6: Relative Position Bias
+
+        ###### BEGIN Class 6: Mask
+        if mask is None:
+            attn = self.softmax(attn)
+        else:
+            # mask: [num_windows, num_patches, num_patches]
+            # attn: [B*num_windows, num_heads, num_patches, num_patches]
+            attn = attn.reshape([B//mask.shape[0], mask.shape[0], self.num_heads, mask.shape[1], mask.shape[1]])
+            # attn: [B, num_windows, num_heads, num_patches, num_patches]
+            # mask: [1, num_windows, 1,         num_patches, num_patches]
+            attn = attn + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.reshape([-1, self.num_heads, mask.shape[1], mask.shape[1]])
+            # attn: [B*num_windows, num_heads, num_patches, num_patches]
+        ###### END Class 6: Mask
+
+        out = paddle.matmul(attn, v)
+        out = out.transpose([0, 2, 1, 3])
+        out = out.reshape([B, N, C])
+        out = self.proj(out)
+        return out
+
+
+# CLASS 5   
+class SwinBlock(nn.Layer):
+    def __init__(self, dim, input_resolution, num_heads, window_size, shift_size=0):
+        super().__init__()
+        self.dim =dim
+        self.resolution = input_resolution
+        self.window_size = window_size
+        self.shift_size = shift_size
+
+        # CLASS 7
+        if min(self.resolution) <= self.window_size:
+            self.shift_size = 0
+            self.windows_size = min(self.resolution)
+
+        self.attn_norm = nn.LayerNorm(dim)
+        self.attn = WindowAttention(dim, window_size, num_heads)
+
+        self.mlp_norm = nn.LayerNorm(dim)
+        self.mlp = Mlp(dim)
+
+        # CLASS 6
+        if self.shift_size > 0:
+            attn_mask = generate_mask(self.window_size, self.shift_size, self.resolution)
+        else:
+            attn_mask = None
+        self.register_buffer('attn_mask', attn_mask)
+
+    def forward(self, x):
+        H, W = self.resolution
+        B, N, C = x.shape
+
+        h = x
+        x = self.attn_norm(x)
+
+        x = x.reshape([B, H, W, C])
+
+        ##### BEGIN CLASS 6
+        if self.shift_size > 0:
+            shifted_x = paddle.roll(x, shifts=(-self.shift_size, -self.shift_size), axis=(1, 2))
+        else:
+            shifted_x = x
+
+        x_windows = windows_partition(shifted_x, self.window_size)
+        x_windows = x_windows.reshape([-1, self.window_size * self.window_size, C])
+        attn_windows = self.attn(x_windows, mask=self.attn_mask)
+        attn_windows = attn_windows.reshape([-1, self.window_size, self.window_size, C])
+        shifted_x = windows_reverse(attn_windows, self.window_size, H, W)
+
+        if self.shift_size > 0:
+            x = paddle.roll(shifted_x, shifts=(self.shift_size, self.shift_size), axis=(1, 2))
+        else:
+            x = shifted_x
+        ##### END CLASS 6
+
+
+        #[B, H, W, C]
+        x = x.reshape([B, H*W, C])
+        x = h + x
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = h + x
+        return x
+
+
+# CLASS 7
+class SwinStage(nn.Layer):
+    def __init__(self, dim, input_resolution, depth, num_heads, window_size, patch_merging=None):
+        super().__init__()
+        self.blocks = nn.LayerList()
+        for i in range(depth):
+            self.blocks.append(
+                SwinBlock(dim=dim,
+                          input_resolution=input_resolution,
+                          num_heads=num_heads,
+                          window_size=window_size,
+                          shift_size=0 if (i % 2) == 0 else window_size // 2))
+        if patch_merging is None:
+            self.patch_merging = Identity()
+        else:
+            self.patch_merging = patch_merging(input_resolution, dim=dim)
+
+    def forward(self, x):
+        for block in self.blocks:
+            x = block(x)
+        x = self.patch_merging(x)
+        return x
+
+# CLASS 7
+class Swin(nn.Layer):
+    def __init__(self,
+                 image_size=224,
+                 patch_size=4,
+                 in_channels=3,
+                 embed_dim=96,
+                 window_size=7,
+                 num_heads=[3, 6, 12, 24],
+                 depths=[2, 2, 6, 2],
+                 num_classes=1000):
+        super().__init__()
+        self.num_classes = num_classes
+        self.depths = depths
+        self.num_heads = num_heads
+        self.embed_dim = embed_dim
+        self.num_stages = len(depths)
+        self.num_features = int(self.embed_dim * 2**(self.num_stages-1))
+
+        self.patch_embedding = PatchEmbedding(patch_size=patch_size, embed_dim=embed_dim)
+        self.patch_resolution = [image_size // patch_size, image_size // patch_size] 
+
+        self.stages = nn.LayerList()
+        for idx, (depth, n_heads) in enumerate(zip(self.depths, self.num_heads)):
+            stage = SwinStage(
+                dim=int(self.embed_dim * 2**idx),
+                input_resolution=(self.patch_resolution[0] // (2**idx), self.patch_resolution[1] // (2**idx)),
+                depth=depth,
+                num_heads=n_heads,
+                window_size=window_size,
+                patch_merging=PatchMerging if (idx < self.num_stages -1) else None)
+            self.stages.append(stage)
+
+        self.norm = nn.LayerNorm(self.num_features)
+        self.avgpool = nn.AdaptiveAvgPool1D(1)
+        self.fc = nn.Linear(self.num_features, self.num_classes)
+
+    def forward(self, x):
+        x = self.patch_embedding(x)
+        for stage in self.stages:
+            print('stage')
+            x = stage(x)
+        x = self.norm(x)
+        x = x.transpose([0, 2, 1]) #[B, embed_dim, num_windows]
+        x = self.avgpool(x)
+        # [B, embed_dim, 1]
+        x = x.flatten(1)
+        x = self.fc(x)
+        return x
+
+    
+def main():
+    t = paddle.randn((4, 3, 224, 224))
+    #patch_embedding = PatchEmbedding(patch_size=4, embed_dim=96)
+    #swin_block_w_msa = SwinBlock(dim=96, input_resolution=[56, 56], num_heads=4, window_size=7, shift_size=0)
+    #swin_block_sw_msa = SwinBlock(dim=96, input_resolution=[56, 56], num_heads=4, window_size=7, shift_size=7//2)
+    #patch_merging = PatchMerging(input_resolution=[56, 56], dim=96)
+
+    #print('image shape = [4, 3, 224, 224]')
+    #out = patch_embedding(t) # [4, 56, 56, 96]
+    #print('patch_embedding out shape = ', out.shape)
+    #out = swin_block_w_msa(out)
+    #out = swin_block_sw_msa(out)
+    #print('swin_block out shape = ', out.shape)
+    #out = patch_merging(out)
+    #print('patch_merging out shape = ', out.shape)
+
+    model = Swin()
+    print(model)
+    out = model(t)
+    print(out.shape)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class7/tmp.py b/edu/class7/tmp.py
new file mode 100644
index 00000000..0acd13d9
--- /dev/null
+++ b/edu/class7/tmp.py
@@ -0,0 +1,32 @@
+
+class MyIterable():
+    def __init__(self):
+        self.data = [1, 2, 3, 4, 5]
+    def __iter__(self):
+        return MyIterator(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+class MyIterator():
+    def __init__(self, data):
+        self.data = data
+        self.counter = 0
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        if self.counter >= len(self.data):
+            raise StopIteration()
+        data = self.data[self.counter]
+        self.counter +=1
+        return data
+
+
+my_iterable = MyIterable()
+
+for d in my_iterable:
+    print(d)
+
+print(my_iterable[0])
diff --git a/edu/class8/a.yaml b/edu/class8/a.yaml
new file mode 100644
index 00000000..f36469ca
--- /dev/null
+++ b/edu/class8/a.yaml
@@ -0,0 +1,6 @@
+DATA:
+    BATCH_SIZE: 512
+MODEL:
+    TRANS:
+        EMBED_DIM: 768
+
diff --git a/edu/class8/config.py b/edu/class8/config.py
new file mode 100644
index 00000000..6bbf43df
--- /dev/null
+++ b/edu/class8/config.py
@@ -0,0 +1,52 @@
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.DATA = CN()
+_C.DATA.DATASET = 'Cifar10'
+_C.DATA.BATCH_SIZE = 128
+
+_C.MODEL = CN()
+_C.MODEL.NUM_CLASSES = 1000
+
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.EMBED_DIM = 96
+_C.MODEL.TRANS.DEPTHS = [2, 2, 6, 2]
+_C.MODEL.TRANS.QKV_BIAS = False 
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    config.merge_from_file(cfg_file)
+    #config.freeze()
+
+
+def update_config(config, args):
+    if args.cfg:
+        _update_config_form_file(config, args.cfg)
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+
+    return config
+
+
+def get_config(cfg_file=None):
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
+
+
+def main():
+    cfg = get_config()
+    print(cfg)
+    print('-----')
+    print(cfg.MODEL.NUM_CLASSES)
+    print('-----')
+    print(cfg.MODEL.TRANS.QKV_BIAS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class8/main.py b/edu/class8/main.py
new file mode 100644
index 00000000..e757e163
--- /dev/null
+++ b/edu/class8/main.py
@@ -0,0 +1,26 @@
+import argparse
+from config import get_config
+from config import update_config
+
+def get_arguments():
+    parser = argparse.ArgumentParser('ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+
+    arguments = parser.parse_args()
+    return arguments
+
+def main():
+    cfg = get_config()
+    print(cfg)
+    print('-----')
+    cfg = get_config('./a.yaml')
+    print(cfg)
+    print('-----')
+    args = get_arguments()
+    cfg = update_config(cfg, args)
+    print(cfg)
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class9/main.py b/edu/class9/main.py
new file mode 100644
index 00000000..7bb64984
--- /dev/null
+++ b/edu/class9/main.py
@@ -0,0 +1,67 @@
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.distributed as dist
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+
+class MyDataset(Dataset):
+    def __init__(self):
+        super().__init__()
+        self.data = np.arange(32).astype('float32')[:, np.newaxis]
+
+    def __getitem__(self, idx):
+        return paddle.to_tensor(self.data[idx]), paddle.to_tensor(self.data[idx])
+
+    def __len__(self):
+        return len(self.data)
+
+def get_dataset():
+    dataset = MyDataset()
+    return dataset
+
+def get_dataloader(dataset, batch_size):
+    sampler = DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=False)
+    dataloader = DataLoader(dataset, batch_sampler=sampler)
+    return dataloader
+
+def build_model():
+    model = nn.Sequential(*[
+        nn.Linear(1, 8),
+        nn.ReLU(),
+        nn.Linear(8,10)])
+
+    return model
+
+def main_worker(*args):
+    dataset = args[0]
+    dataloader = get_dataloader(dataset, batch_size=1)
+    dist.init_parallel_env()
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+
+    model = build_model()
+    model = paddle.DataParallel(model)
+    print(f'Hello PPViT, I am [{local_rank}]: I built a model for myself.')
+
+    tensor_list = []
+    for data in dataloader:
+        sample = data[0]
+        label = data[1]
+
+        out = model(sample)
+        out = out.argmax(1)
+        print(f'[{local_rank}]: I got data: {sample.cpu().numpy()}, label: {sample.cpu().numpy()}, out: {out.cpu().numpy()}')
+
+        dist.all_gather(tensor_list, out)
+        if local_rank == 0:
+            print(f'I am master ([{local_rank}]): I got all_gathered out: {tensor_list}')
+        break
+
+def main():
+    dataset = get_dataset()
+    dist.spawn(main_worker, args=(dataset, ), nprocs=8)
+
+if __name__ == "__main__":
+    main()
diff --git a/edu/class_fig.png b/edu/class_fig.png
new file mode 100644
index 00000000..268991e3
Binary files /dev/null and b/edu/class_fig.png differ
diff --git a/gan/README.md b/gan/README.md
index 91a09b54..efc6982b 100644
--- a/gan/README.md
+++ b/gan/README.md
@@ -1,3 +1,5 @@
+English | [简体中文](./README_cn.md)
+
 # PaddleViT-GAN: Visual Transformer Models for GAN
 PaddlePaddle training/validation code and pretrained models for **GAN**.
 
@@ -16,7 +18,7 @@ Update (2021-08-25): Init readme uploaded.
 ## Installation
 This module is tested on Python3.6+, and PaddlePaddle 2.1.0+. Most dependencies are installed by PaddlePaddle installation. You only need to install the following packages:
 ```shell
-pip install yacs yaml lmdb
+pip install yacs pyyaml lmdb
 ```
 Then download the github repo:
 ```shell
@@ -64,8 +66,8 @@ from generator import Generator
 config = get_config('./configs/styleformer_cifar10.yaml')
 # build model
 model = Generator(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./cifar10')
+# load pretrained weights
+model_state_dict = paddle.load('./cifar10.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -77,10 +79,10 @@ sh run_generate.sh
 or 
 ```shell
 python generate.py \
-  -cfg='./configs/styleformer_cifar10.yaml' \
+  -cfg=./configs/styleformer_cifar10.yaml \
   -num_out_images=16 \
-  -out_folder='./images_cifar10' \
-  -pretrained='./cifar10.pdparams'
+  -out_folder=./images_cifar10 \
+  -pretrained=/path/to/pretrained/model/cifar10  # .pdparams is NOT needed
 ```
 The output images are stored in `-out_folder` path.
 
diff --git a/gan/README_cn.md b/gan/README_cn.md
new file mode 100644
index 00000000..5864025b
--- /dev/null
+++ b/gan/README_cn.md
@@ -0,0 +1,115 @@
+简体中文 | [English](./README.md)
+
+# PaddleViT-GAN: GAN 领域的 Visual Transformer 模型
+  
+PaddlePaddle **GAN**的训练/评估代码以及预训练模型。
+
+此实现是[PaddleViT](https://github.com/BR-IDL/PaddleViT)项目的一部分。
+
+## 更新 
+更新 (2021-08-25): 已上传初始化文件.
+
+## 快速开始
+
+ 以下链接提供了每个模型架构的代码和详细用法：
+1. **[Styleformer](./Styleformer)**
+2. **[TransGAN](./transGAN)**
+
+
+## 安装
+该模块在 Python3.6+ 和 PaddlePaddle 2.1.0+ 上进行了测试，大多数依赖项通过PaddlePaddle安装，您只需要安装以下依赖项：
+
+```shell
+pip install yacs pyyaml lmdb
+```
+然后 下载 github repo:
+```shell
+git clone https://github.com/xperzy/PPViT.git
+cd PPViT/image_classification
+```
+
+## 基本用法
+### 数据准备
+**Cifar10**, **STL10**, **Celeba** 和 **LSUNchurch** 数据集以如下结构使用:
+#### [Cifar10](https://www.cs.toronto.edu/~kriz/cifar.html):
+   
+   我们使用 `paddle.io.Dataset.Cifar10` 创建Cifar10 dataset, 不需要手动的下载或准备数据.
+#### [STL10](https://cs.stanford.edu/~acoates/stl10/):
+```
+│STL10/
+├── train_X.bin
+│── train_y.bin
+├── test_X.bin
+│── test_y.bin
+│── unlabeled.bin
+```
+#### [CelebA](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html):
+```
+│Celeba/
+├──img_align_celeba/
+│  ├── 000017.jpg
+│  │── 000019.jpg
+│  ├── 000026.jpg
+│  │── ......
+```
+#### [LSUN-church](https://www.yf.io/p/lsun):
+```
+│LSUNchurch/
+├──church_outdoor_train_lmdb/
+│  ├── data.mdb
+│  │── lock.mdb
+```
+### Demo 示例
+对于具体模型示例，进入模型文件夹，下载预训练权重文件，例如`./cifar10.pdparams`, 然后在python中使用 `styleformer_cifar10` 模型:
+```python
+from config import get_config
+from generator import Generator
+# config files in ./configs/
+config = get_config('./configs/styleformer_cifar10.yaml')
+# build model
+model = Generator(config)
+# load pretrained weights
+model_state_dict = paddle.load('./cifar10.pdparams')
+model.set_dict(model_state_dict)
+```
+
+### 生成示例图像
+要想从预训练模型中生成示例图像，请先下载预训练权重，然后使用命令行运行以下脚本：
+```shell
+sh run_generate.sh
+```
+or 
+```shell
+python generate.py \
+  -cfg=./configs/styleformer_cifar10.yaml \
+  -num_out_images=16 \
+  -out_folder=./images_cifar10 \
+  -pretrained=/path/to/pretrained/model/cifar10  # .pdparams is NOT needed
+```
+输出图像存储在 `-out_folder` 路径中.
+
+> 注意：具体用法见各模型文件夹中的README文件.
+
+## Basic Concepts
+PaddleViT图像分类模块为每个模型在每一个单独文件夹中以相似结构进行开发，每个实现中大约有3种类型的类和2种类型的脚本：
+1. **Model classes** 例如 **[ViT_custom.py](./transGAN/models/ViT_custom.py)**, 其中定义了核心 *transformer model* 以及相关方法.
+   
+2. **Dataset classes** 例如 **[dataset.py](./gan/transGAN/datasets.py)**, 其中定义了 dataset, dataloader, data transforms. 我们提供了自定义数据加载的实现方式，并且支持单GPU和多GPU加载.
+   
+3. **Config classes** 例如**[config.py](./gan/transGAN/config.py)**, 其中定义了模型训练/验证的配置. 通常，您不需要在配置文件中更改项目，可以通过python `arguments` 或者 `.yaml` 配置文件来更新配置. 您可以查看[here](../docs/ppvit-config.md) 了解关于配置设计和使用的详细信息.
+   
+4. **main scripts** 例如 **[main_single_gpu.py](./transGAN/main_single_gpu.py)**, 其中定义了整个训练/验证过程。提供了训练/验证的主要步骤，例如日志记录、加载/保存模型、微调等。 多-GPU的训练/验证过程在单独的python脚本 `main_multi_gpu.py`中实现.
+   
+5. **run scripts** 例如 **[run_eval_cifar.sh](./transGAN/run_eval_cifar.sh)**, 其中定义了用于运行使用特定配置和参数的python脚本的命令.
+   
+
+## Model Architectures
+
+PaddleViT 目前支持以下 **transfomer based models**:
+1. **[TransGAN](./transGAN)** (from Seoul National University and NUUA), released with paper [TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up](https://arxiv.org/abs/2102.07074), by Yifan Jiang, Shiyu Chang, Zhangyang Wang.
+2. **[Styleformer](./Styleformer)** (from Facebook and Sorbonne), released with paper [Styleformer: Transformer based Generative Adversarial Networks with Style Vector](https://arxiv.org/abs/2106.07023), by Jeeseung Park, Younggeun Kim.
+
+
+
+## Contact
+如果您有任何问题, 请在我们的Github上创建一个[issue](https://github.com/BR-IDL/PaddleViT/issues).
diff --git a/gan/Styleformer/README.md b/gan/Styleformer/README.md
index 39d93cba..0e285871 100644
--- a/gan/Styleformer/README.md
+++ b/gan/Styleformer/README.md
@@ -72,8 +72,8 @@ from generator import Generator
 config = get_config('./configs/styleformer_cifar10.yaml')
 # build model
 model = Generator(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./cifar10')
+# load pretrained weights
+model_state_dict = paddle.load('./cifar10.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -85,10 +85,10 @@ sh run_generate.sh
 or 
 ```shell
 python generate.py \
-  -cfg='./configs/styleformer_cifar10.yaml' \
+  -cfg=./configs/styleformer_cifar10.yaml \
   -num_out_images=16 \
-  -out_folder='./images_cifar10' \
-  -pretrained='./cifar10.pdparams'
+  -out_folder=./images_cifar10 \
+  -pretrained=/path/to/pretrained/model/cifar10  # .pdparams is NOT needed
 ```
 The output images are stored in `-out_folder` path.
 
@@ -102,11 +102,11 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/styleformer_cifar10.yaml' \
-    -dataset='cifar10' \
+    -cfg=./configs/styleformer_cifar10.yaml \
+    -dataset=cifar10 \
     -batch_size=32 \
     -eval \
-    -pretrained='./cifar10'
+    -pretrained=/path/to/pretrained/model/cifar10 # .pdparams is NOT needed
 ```
 
 <details>
@@ -123,11 +123,11 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_single_gpu.py \
-    -cfg='./configs/styleformer_cifar10.yaml' \
-    -dataset='cifar10' \
+    -cfg=./configs/styleformer_cifar10.yaml \
+    -dataset=cifar10 \
     -batch_size=32 \
     -eval \
-    -pretrained='./cifar10'
+    -pretrained=/path/to/pretrained/model/cifar10.pdparams  # .pdparams is NOT needed
 ```
 
 </details>
@@ -143,9 +143,9 @@ or
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
     -cfg='./configs/styleformer_cifar10.yaml' \
-    -dataset='cifar10' \
+    -dataset=cifar10 \
     -batch_size=32 \
-    -pretrained='./cifar10'
+    -pretrained=/path/to/pretrained/model/cifar10 # .pdparams is NOT needed
 ```
 
 <details>
@@ -162,10 +162,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python main_single_gpu.py \
-    -cfg='./configs/styleformer_cifar10.yaml' \
-    -dataset='cifar10' \
+    -cfg=./configs/styleformer_cifar10.yaml \
+    -dataset=cifar10 \
     -batch_size=32 \
-    -pretrained='./cifar10'
+    -pretrained=/path/to/pretrained/model/cifar10  # .pdparams is NOT needed
 ```
 
 </details>
diff --git a/gan/transGAN/models/ViT_custom.py b/gan/transGAN/models/ViT_custom.py
index 2730987e..b0b38230 100644
--- a/gan/transGAN/models/ViT_custom.py
+++ b/gan/transGAN/models/ViT_custom.py
@@ -20,6 +20,7 @@
 import paddle.nn as nn
 from utils import trunc_normal_
 from utils import gelu
+from utils import leakyrelu
 from utils import pixel_upsample
 from utils import drop_path
 
@@ -56,7 +57,7 @@ def __init__(self, dim):
         super().__init__()
 
     def forward(self, input):
-        return input * paddle.rsqrt(paddle.mean(input ** 2, dim=2, keepdim=True) + 1e-8)
+        return input * paddle.rsqrt(paddle.mean(input ** 2, axis=2, keepdim=True) + 1e-8)
 
 class CustomNorm(nn.Layer):
     """ CustomNorm layer
diff --git a/gan/transGAN/readme.md b/gan/transGAN/readme.md
index baecde8b..8e2444a7 100644
--- a/gan/transGAN/readme.md
+++ b/gan/transGAN/readme.md
@@ -46,8 +46,8 @@ from models.ViT_custom import Generator
 config = get_config('./configs/transgan_cifar10.yaml')
 # build model
 model = Generator(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./transgan_cifar10')
+# load pretrained weights
+model_state_dict = paddle.load('./transgan_cifar10.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -59,10 +59,10 @@ sh run_generate.sh
 or 
 ```shell
 python generate.py \
-  -cfg='./configs/transgan_cifar10.yaml' \
+  -cfg=./configs/transgan_cifar10.yaml \
   -num_out_images=16 \
-  -out_folder='./images_cifar10' \
-  -pretrained='./transgan_cifar10.pdparams'
+  -out_folder=./images_cifar10 \
+  -pretrained=/path/to/pretrained/model/transgan_cifar10  # .pdparams is NOT needed
 ```
 The output images are stored in `-out_folder` path.
 
@@ -76,11 +76,11 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg="./configs/transgan_cifar10.yaml" \
-  -dataset='cifar10' \
+  -cfg=./configs/transgan_cifar10.yaml \
+  -dataset=cifar10 \
   -batch_size=32 \
   -eval \
-  -pretrained='./transgan_cifar10'
+  -pretrained=/path/to/pretrained/model/transgan_cifar10.pdparams  # .pdparams is NOT needed
 ```
 <details>
 
@@ -96,12 +96,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/transgan_cifar10.yaml' \
-    -dataset='cifar10' \
+    -cfg=./configs/transgan_cifar10.yaml \
+    -dataset=cifar10 \
     -batch_size=32 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./transgan_cifar10'
+    -pretrained=/path/to/pretrained/model/transgan_cifar10  # .pdparams is NOT needed
 ```
 
 </details>
@@ -117,8 +117,8 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg="./configs/transgan_cifar10.yaml" \
-  -dataset='cifar10' \
+  -cfg=./configs/transgan_cifar10.yaml \
+  -dataset=cifar10 \
   -batch_size=32 \
 ```
 <details>
@@ -135,10 +135,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/transgan_cifar10.yaml' \
-    -dataset='cifar10' \
+    -cfg=./configs/transgan_cifar10.yaml \
+    -dataset=cifar10 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/gan/transGAN/utils.py b/gan/transGAN/utils.py
index c220ac8a..58a68ad5 100644
--- a/gan/transGAN/utils.py
+++ b/gan/transGAN/utils.py
@@ -61,6 +61,15 @@ def gelu(x):
     return x * 0.5 * (1.0 + paddle.erf(x / math.sqrt(2.0)))
 
 
+def leakyrelu(x):
+    """ An activation function：
+        if x > 0, return x. else return negative_slope * x. the value of negative_slope
+        is 0.2. more information can see https://www.paddlepaddle.org.cn/documentation/
+        docs/zh/api/paddle/nn/functional/leaky_relu_cn.html#leaky-relu
+    """
+    return F.leaky_relu(x, 0.2)
+
+
 def _no_grad_trunc_normal_(tensor, mean, std, a, b):
     # Cut & paste from PyTorch official master until it's in a few official releases - RW
     # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
diff --git a/image_classification/BEiT/README.md b/image_classification/BEiT/README.md
new file mode 100644
index 00000000..0236ec11
--- /dev/null
+++ b/image_classification/BEiT/README.md
@@ -0,0 +1,178 @@
+# BEiT: BERT Pre-Training of Image Transformers, [arxiv](https://arxiv.org/abs/2106.08254) 
+
+PaddlePaddle training/validation code and pretrained models for **BEiT**.
+
+The official and 3rd party pytorch implementation are [here](https://github.com/microsoft/unilm/tree/master/beit).
+
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT).
+
+<p align="center">
+<img src="./beit.png" alt="drawing" width="90%" height="90%"/>
+<h4 align="center">BEiT Model Overview</h4>
+</p>
+
+
+
+### Update 
+- Update (2021-10-19): Bug fix and weights links are updated.
+- Update (2021-09-27): Code is released and ported weights are uploaded.
+
+## Models Zoo
+
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| beit_base_patch16_224   | 85.21 | 97.66 | 87M    | 12.7G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1lq5NeQRDHkIQi7U61OidaLhNsXTWfh_Z/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1pjblqaESqfXVrpgo58oR6Q)(fshn) |
+| beit_base_patch16_384   | 86.81 | 98.14 | 87M    | 37.3G   | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1wn2NS7kUdlERkzWEDeyZKmcRbmWL7TR2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WVbNjxuIUh514pKAgZZEzg)(arvc) |
+| beit_large_patch16_224  | 87.48 | 98.30 | 304M   | 45.0G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/11OR1FKxzfafqT7GzTW225nIQjxmGSbCm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1bvhERVXN2TyRcRJFzg7sIA)(2ya2) |
+| beit_large_patch16_384  | 88.40 | 98.60 | 304M   | 131.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/10EraafYS8CRpEshxClOmE2S1eFCULF1Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1H76G2CGLY3YmmYt4-suoRA)(qtrn) |
+| beit_large_patch16_512  | 88.60 | 98.66 | 304M   | 234.0G  | 512        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1xIIocftsB1PcDHZttPqLdrJ-G4Tyfrs-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WtTVK_Wvg-izaF0M6Gzw-Q)(567v) |
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+>
+> *These models have been fine-tuned (ImageNet 22k -> 1k)
+>
+> Note: BEiT weights are ported from [here](https://github.com/microsoft/unilm/tree/master/beit)
+
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./beit_base_patch16_224_ft22kto1k.pdparams`, to use the `beit_base_patch16_224_ft22kto1k` model in python:
+```python
+from config import get_config
+from beit import build_beit as build_model
+# config files in ./configs/
+config = get_config('./configs/beit_base_patch16_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./beit_base_patch16_224_ft22kto1k.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate BEiT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/beit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/beit_base_patch16_224_ft22kto1k  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/beit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/beit_base_patch16_224_ft22kto1k  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the BEiT Transformer model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/beit_base_patch16_224.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=32 \
+  -data_path=/path/to/dataset/imagenet/train \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/beit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train \ 
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{beit,
+      title={{BEiT}: {BERT} Pre-Training of Image Transformers}, 
+      author={Hangbo Bao and Li Dong and Furu Wei},
+      year={2021},
+      eprint={2106.08254},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/image_classification/BEiT/__init_.py b/image_classification/BEiT/__init_.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/BEiT/__init_.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/BEiT/augment.py b/image_classification/BEiT/augment.py
new file mode 100644
index 00000000..19276756
--- /dev/null
+++ b/image_classification/BEiT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/BEiT/beit.png b/image_classification/BEiT/beit.png
new file mode 100644
index 00000000..268b2fc7
Binary files /dev/null and b/image_classification/BEiT/beit.png differ
diff --git a/image_classification/BEiT/beit.py b/image_classification/BEiT/beit.py
new file mode 100644
index 00000000..867e0ae4
--- /dev/null
+++ b/image_classification/BEiT/beit.py
@@ -0,0 +1,517 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for BEiT
+"""
+
+import math
+import copy
+from functools import partial
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from droppath import DropPath
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+
+
+class Mlp(nn.Layer):
+    """MLP module
+
+    MLP using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc1 -> act -> dropout -> fc2 -> dropout
+
+    """
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.0):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class PatchEmbed(nn.Layer):
+    """2D Image to Patch Embedding
+
+    Apply patch embeddings on input images. Embeddings is implemented using a Conv2D op.
+
+    """
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 in_chans=3,
+                 embed_dim=768,
+                 norm_layer=None,
+                 flatten=True):
+        super().__init__()
+        img_size = (img_size, img_size)
+        patch_size = (patch_size, patch_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+        self.flatten = flatten
+
+        self.proj = nn.Conv2D(
+            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
+        )
+        self.norm = norm_layer(embed_dim) if norm_layer else Identity()
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        assert (
+            H == self.img_size[0] and W == self.img_size[1]
+        ), f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})"
+        x = self.proj(x)
+        if self.flatten:
+            x = x.flatten(2).transpose((0, 2, 1))  # BCHW -> BNC
+        x = self.norm(x)
+        return x
+
+
+class Identity(nn.Layer):
+    """Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, inputs):
+        return inputs
+
+
+class Attention(nn.Layer):
+    """Attention Layer"""
+    def __init__(self,
+                 dim,
+                 num_heads=8,
+                 qkv_bias=False,
+                 attn_drop=0.0,
+                 proj_drop=0.0,
+                 window_size=None,
+                 attn_head_dim=None):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        if attn_head_dim is not None:
+            head_dim = attn_head_dim
+        all_head_dim = head_dim * self.num_heads
+        self.scale = head_dim ** -0.5
+
+        self.qkv = nn.Linear(dim, all_head_dim * 3, bias_attr=False)
+        if qkv_bias:
+            self.q_bias = paddle.create_parameter(
+                shape=[all_head_dim], dtype="float32", default_initializer=zeros_
+            )
+
+            self.v_bias = paddle.create_parameter(
+                shape=[all_head_dim], dtype="float32", default_initializer=zeros_
+            )
+        else:
+            self.q_bias = None
+            self.v_bias = None
+
+        if window_size:
+            self.window_size = window_size
+            self.num_relative_distance = (2 * window_size[0] - 1) * (
+                2 * window_size[1] - 1
+            ) + 3
+
+            self.relative_position_bias_table = paddle.create_parameter(
+                shape=[self.num_relative_distance, num_heads],
+                dtype="float32",
+                default_initializer=zeros_,
+            )  # 2*Wh-1 * 2*Ww-1, nH
+            # cls to token & token 2 cls & cls to cls
+
+            # get pair-wise relative position index for each token inside the window
+            coords_h = paddle.arange(window_size[0])
+            coords_w = paddle.arange(window_size[1])
+            coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+            coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
+            relative_coords = coords_flatten.unsqueeze(
+                axis=2
+            ) - coords_flatten.unsqueeze(
+                axis=1
+            )  # 2, Wh*Ww, Wh*Ww #??
+            relative_coords = relative_coords.transpose([1, 2, 0])  # Wh*Ww, Wh*Ww, 2
+            relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
+            relative_coords[:, :, 1] += window_size[1] - 1
+            relative_coords[:, :, 0] *= 2 * window_size[1] - 1
+            relative_position_index = paddle.zeros(
+                [
+                    window_size[0] * window_size[1] + 1,
+                    window_size[0] * window_size[1] + 1,
+                ],
+                dtype=relative_coords.dtype,
+            )
+            # Wh*Ww, Wh*Ww
+            relative_position_index[1:, 1:] = relative_coords.sum(-1)
+            relative_position_index[0, 0:] = self.num_relative_distance - 3
+            relative_position_index[0:, 0] = self.num_relative_distance - 2
+            relative_position_index[0, 0] = self.num_relative_distance - 1
+
+            self.register_buffer("relative_position_index", relative_position_index)
+        else:
+            self.window_size = None
+            self.relative_position_bias_table = None
+            self.relative_position_index = None
+
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(all_head_dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x, rel_pos_bias):
+        B, N, C = x.shape
+        qkv_bias = None
+        if self.q_bias is not None:
+            qkv_bias = paddle.concat(
+                (self.q_bias, paddle.zeros_like(self.v_bias), self.v_bias)
+            )
+
+        qkv = F.linear(x=x, weight=self.qkv.weight, bias=qkv_bias)
+
+        qkv = qkv.reshape([B, N, 3, self.num_heads, -1]).transpose([2, 0, 3, 1, 4])
+        # make torchscript happy (cannot use tensor as tuple)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        q = q * self.scale
+
+        attn = q @ k.transpose([0, 1, 3, 2])
+
+        if self.relative_position_bias_table is not None:
+            relative_position_bias = self.relative_position_bias_table[
+                self.relative_position_index.reshape([-1])
+            ].reshape(
+                [
+                    self.window_size[0] * self.window_size[1] + 1,
+                    self.window_size[0] * self.window_size[1] + 1,
+                    -1,
+                ]
+            )  # Wh*Ww,Wh*Ww,nH
+            relative_position_bias = relative_position_bias.transpose(
+                [2, 0, 1]
+            )  # nH, Wh*Ww, Wh*Ww
+
+            attn = attn + relative_position_bias.unsqueeze(axis=0)
+
+        if rel_pos_bias is not None:
+            attn = attn + rel_pos_bias
+
+        attn = F.softmax(attn, axis=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose([0, 2, 1, 3]).reshape([B, N, -1])
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class Block(nn.Layer):
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.0,
+                 qkv_bias=False,
+                 drop=0.0,
+                 attn_drop=0.0,
+                 drop_path=0.0,
+                 init_values=None,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 window_size=None,
+                 attn_head_dim=None):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+            window_size=window_size,
+            attn_head_dim=attn_head_dim,
+        )
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+        )
+
+        if init_values:
+            self.gamma_1 = paddle.create_parameter(
+                shape=[dim],
+                dtype="float32",
+                default_initializer=nn.initializer.Constant(value=init_values),
+            )
+            self.gamma_2 = paddle.create_parameter(
+                shape=[dim],
+                dtype="float32",
+                default_initializer=nn.initializer.Constant(value=init_values),
+            )
+        else:
+            self.gamma_1, self.gamma_2 = None, None
+
+    def forward(self, x, rel_pos_bias):
+        if self.gamma_1 is None:
+            x = x + self.drop_path(self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias))
+            x = x + self.drop_path(self.mlp(self.norm2(x)))
+        else:
+            x = x + self.drop_path(
+                self.gamma_1 * self.attn(self.norm1(x), rel_pos_bias=rel_pos_bias)
+            )
+            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
+        return x
+
+
+class RelativePositionBias(nn.Layer):
+    def __init__(self, window_size, num_heads):
+        super().__init__()
+        self.window_size = window_size
+        self.num_relative_distance = (2 * window_size[0] - 1) * (
+            2 * window_size[1] - 1
+        ) + 3
+
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=[self.num_relative_distance, num_heads],
+            dtype="float32",
+            default_initializer=zeros_,
+        )  # 2*Wh-1 * 2*Ww-1, nH
+        # cls to token & token 2 cls & cls to cls
+
+        # get pair-wise relative position index for each token inside the window
+        coords_h = paddle.arange(window_size[0])
+        coords_w = paddle.arange(window_size[1])
+        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
+        relative_coords = coords_flatten.unsqueeze(axis=2) - coords_flatten.unsqueeze(
+            axis=1
+        )  # 2, Wh*Ww, Wh*Ww
+        relative_coords = relative_coords.transpose([1, 2, 0])  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * window_size[1] - 1
+        relative_position_index = paddle.zeros(
+            [window_size[0] * window_size[1] + 1, window_size[0] * window_size[1] + 1]
+        )
+        relative_position_index[1:, 1:] = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        relative_position_index[0, 0:] = self.num_relative_distance - 3
+        relative_position_index[0:, 0] = self.num_relative_distance - 2
+        relative_position_index[0, 0] = self.num_relative_distance - 1
+
+        self.register_buffer("relative_position_index", relative_position_index)
+
+        # trunc_normal_(self.relative_position_bias_table, std=.02)
+
+    def forward(self):
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.reshape([-1])].reshape(
+                self.window_size[0] * self.window_size[1] + 1,
+                self.window_size[0] * self.window_size[1] + 1, -1)  # Wh*Ww,Wh*Ww,nH
+        return relative_position_bias.transpose([2, 0, 1])  # nH, Wh*Ww, Wh*Ww
+
+
+class Beit(nn.Layer):
+    """Beit Layer"""
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=12,
+                 num_heads=12,
+                 mlp_ratio=4.0,
+                 qkv_bias=True,
+                 drop_rate=0.0,
+                 attn_drop_rate=0.0,
+                 drop_path_rate=0.0,
+                 norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
+                 init_values=None,
+                 use_abs_pos_emb=True,
+                 use_rel_pos_bias=False,
+                 use_shared_rel_pos_bias=False,
+                 use_mean_pooling=True,
+                 init_scale=0.001):
+        super().__init__()
+        self.num_classes = num_classes
+        # num_features for consistency with other models
+        self.num_features = self.embed_dim = embed_dim
+
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+        )
+        num_patches = self.patch_embed.num_patches
+
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 1, embed_dim],
+            dtype="float32",
+            default_initializer=trunc_normal_,
+        )
+
+        if use_abs_pos_emb:
+            self.pos_embed = paddle.create_parameter(
+                shape=[1, num_patches + 1, embed_dim],
+                dtype="float32",
+                default_initializer=trunc_normal_,
+            )
+        else:
+            self.pos_embed = None
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        if use_shared_rel_pos_bias:
+            self.rel_pos_bias = RelativePositionBias(
+                window_size=self.patch_embed.grid_size, num_heads=num_heads
+            )
+        else:
+            self.rel_pos_bias = None
+
+        # stochastic depth decay rule
+        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, depth)]
+        self.use_rel_pos_bias = use_rel_pos_bias
+        self.blocks = nn.LayerList(
+            [
+                Block(
+                    dim=embed_dim,
+                    num_heads=num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=dpr[i],
+                    norm_layer=norm_layer,
+                    init_values=init_values,
+                    window_size=self.patch_embed.grid_size if use_rel_pos_bias else None,
+                )
+                for i in range(depth)
+            ]
+        )
+        self.norm = Identity() if use_mean_pooling else norm_layer(embed_dim)
+        self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()
+
+        self.apply(self._init_weights)
+        self.fix_init_weight()
+        if isinstance(self.head, nn.Linear):
+            trunc_normal_(self.head.weight)
+            self.head.weight.set_value(
+                self.head.weight.multiply(paddle.to_tensor(init_scale))
+            )
+            self.head.bias.set_value(
+                self.head.bias.multiply(paddle.to_tensor(init_scale))
+            )
+
+    def fix_init_weight(self):
+        def rescale(param, layer_id):
+            param.set_value(param.divide(paddle.to_tensor(math.sqrt(2.0 * layer_id))))
+
+        for layer_id, layer in enumerate(self.blocks):
+            rescale(layer.attn.proj.weight, layer_id + 1)
+            rescale(layer.mlp.fc2.weight, layer_id + 1)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                zeros_(m.bias)
+        elif isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+
+    def get_num_layers(self):
+        return len(self.blocks)
+
+    def get_classifier(self):
+        return self.head
+
+    def reset_classifier(self, num_classes):
+        self.num_classes = num_classes
+        self.head = (
+            nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else Identity()
+        )
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        batch_size, seq_len, _ = x.shape
+
+        cls_tokens = self.cls_token.expand([batch_size, -1, -1])
+
+        x = paddle.concat((cls_tokens, x), axis=1)
+
+        if self.pos_embed is not None:
+            x = x + self.pos_embed
+        x = self.pos_drop(x)
+
+        rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
+        for blk in self.blocks:
+            x = blk(x, rel_pos_bias=rel_pos_bias)
+
+        x = self.norm(x)
+        if self.fc_norm is not None:
+            t = x[:, 1:, :]
+            return self.fc_norm(t.mean(1))
+
+        return x[:, 0]
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+
+def build_beit(config):
+    """ build beit from config"""
+    model = Beit(
+        img_size=config.DATA.IMAGE_SIZE,
+        num_classes=config.MODEL.NUM_CLASSES,
+        patch_size=config.MODEL.TRANS.PATCH_SIZE,
+        embed_dim=config.MODEL.TRANS.EMBED_DIM,
+        depth=config.MODEL.TRANS.DEPTH,
+        num_heads=config.MODEL.TRANS.NUM_HEADS,
+        mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+        use_abs_pos_emb=config.MODEL.TRANS.USE_ABS_POS_EMB,
+        use_rel_pos_bias=config.MODEL.TRANS.USE_REL_POS_BIAS,
+        init_values=config.MODEL.TRANS.INIT_VALUES,
+    )
+    return model
diff --git a/image_classification/BEiT/config.py b/image_classification/BEiT/config.py
new file mode 100644
index 00000000..07043cc4
--- /dev/null
+++ b/image_classification/BEiT/config.py
@@ -0,0 +1,189 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.5, 0.5, 0.5] #[0.485, 0.456, 0.406]
+_C.DATA.IMAGENET_STD = [0.5, 0.5, 0.5] #[0.229, 0.224, 0.225]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'BEiT'
+_C.MODEL.NAME = 'BEiT'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PATCH_SIZE = 16
+_C.MODEL.TRANS.EMBED_DIM = 768
+_C.MODEL.TRANS.DEPTH = 12
+_C.MODEL.TRANS.NUM_HEADS = 12
+_C.MODEL.TRANS.MLP_RATIO = 4
+_C.MODEL.TRANS.QKV_BIAS = True
+_C.MODEL.TRANS.USE_ABS_POS_EMB = False
+_C.MODEL.TRANS.USE_REL_POS_BIAS = True
+_C.MODEL.TRANS.INIT_VALUES = 0.1
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 5e-4
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# augmentation
+_C.AUG = CN()
+_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
+_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
+_C.AUG.RE_PROB = 0.25 # random earse prob
+_C.AUG.RE_MODE = 'pixel' # random earse mode
+_C.AUG.RE_COUNT = 1 # random earse count
+_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
+_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
+_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/BEiT/configs/beit_base_patch16_224.yaml b/image_classification/BEiT/configs/beit_base_patch16_224.yaml
new file mode 100644
index 00000000..37bf72e2
--- /dev/null
+++ b/image_classification/BEiT/configs/beit_base_patch16_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: BEiT
+    NAME: beit_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        DEPTH: 12
+        NUM_HEADS: 12
+        MLP_RATIO: 4
+        USE_ABS_POS_EMB: False
+        USE_REL_POS_BIAS: True
+        INIT_VALUES: 0.1
\ No newline at end of file
diff --git a/image_classification/BEiT/configs/beit_base_patch16_384.yaml b/image_classification/BEiT/configs/beit_base_patch16_384.yaml
new file mode 100644
index 00000000..57a2296d
--- /dev/null
+++ b/image_classification/BEiT/configs/beit_base_patch16_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: BEiT
+    NAME: beit_base_patch16_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        DEPTH: 12
+        NUM_HEADS: 12
+        MLP_RATIO: 4
+        USE_ABS_POS_EMB: False
+        USE_REL_POS_BIAS: True
+        INIT_VALUES: 0.1
\ No newline at end of file
diff --git a/image_classification/BEiT/configs/beit_large_patch16_224.yaml b/image_classification/BEiT/configs/beit_large_patch16_224.yaml
new file mode 100644
index 00000000..938b9501
--- /dev/null
+++ b/image_classification/BEiT/configs/beit_large_patch16_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: BEiT
+    NAME: beit_large_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 1024
+        DEPTH: 24
+        NUM_HEADS: 16
+        MLP_RATIO: 4
+        USE_ABS_POS_EMB: False
+        USE_REL_POS_BIAS: True
+        INIT_VALUES: 1e-5
\ No newline at end of file
diff --git a/image_classification/BEiT/configs/beit_large_patch16_384.yaml b/image_classification/BEiT/configs/beit_large_patch16_384.yaml
new file mode 100644
index 00000000..e8df4b09
--- /dev/null
+++ b/image_classification/BEiT/configs/beit_large_patch16_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: BEiT
+    NAME: beit_large_patch16_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 1024
+        DEPTH: 24
+        NUM_HEADS: 16
+        MLP_RATIO: 4
+        USE_ABS_POS_EMB: False
+        USE_REL_POS_BIAS: True
+        INIT_VALUES: 1e-5
\ No newline at end of file
diff --git a/image_classification/BEiT/configs/beit_large_patch16_512.yaml b/image_classification/BEiT/configs/beit_large_patch16_512.yaml
new file mode 100644
index 00000000..bda24cd5
--- /dev/null
+++ b/image_classification/BEiT/configs/beit_large_patch16_512.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 512
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: BEiT
+    NAME: beit_large_patch16_512
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 1024
+        DEPTH: 24
+        NUM_HEADS: 16
+        MLP_RATIO: 4
+        USE_ABS_POS_EMB: False
+        USE_REL_POS_BIAS: True
+        INIT_VALUES: 1e-5
\ No newline at end of file
diff --git a/image_classification/BEiT/datasets.py b/image_classification/BEiT/datasets.py
new file mode 100644
index 00000000..e57d0332
--- /dev/null
+++ b/image_classification/BEiT/datasets.py
@@ -0,0 +1,214 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/BEiT/droppath.py b/image_classification/BEiT/droppath.py
new file mode 100644
index 00000000..65e0a782
--- /dev/null
+++ b/image_classification/BEiT/droppath.py
@@ -0,0 +1,49 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/BEiT/losses.py b/image_classification/BEiT/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/BEiT/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/BEiT/main_multi_gpu.py b/image_classification/BEiT/main_multi_gpu.py
new file mode 100644
index 00000000..7fd3ff8c
--- /dev/null
+++ b/image_classification/BEiT/main_multi_gpu.py
@@ -0,0 +1,583 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BEiT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from beit import build_beit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Swin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    linear_scaled_lr = (config.TRAIN.BASE_LR *
+        config.DATA.BATCH_SIZE * dist.get_world_size()) / 512.0
+    linear_scaled_warmup_start_lr = (config.TRAIN.WARMUP_START_LR *
+        config.DATA.BATCH_SIZE * dist.get_world_size()) / 512.0
+    linear_scaled_end_lr = (config.TRAIN.END_LR *
+        config.DATA.BATCH_SIZE * dist.get_world_size()) / 512.0
+
+    if config.TRAIN.ACCUM_ITER > 1:
+        linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+    
+    config.TRAIN.BASE_LR = linear_scaled_lr
+    config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+    config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/BEiT/main_single_gpu.py b/image_classification/BEiT/main_single_gpu.py
new file mode 100644
index 00000000..7fc075f3
--- /dev/null
+++ b/image_classification/BEiT/main_single_gpu.py
@@ -0,0 +1,424 @@
+
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BEiT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from beit import build_beit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Swin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    linear_scaled_lr = (config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / 512.0
+    linear_scaled_warmup_start_lr = (config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / 512.0
+    linear_scaled_end_lr = (config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / 512.0
+
+    if config.TRAIN.ACCUM_ITER > 1:
+        linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+
+    config.TRAIN.BASE_LR = linear_scaled_lr
+    config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+    config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/BEiT/mixup.py b/image_classification/BEiT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/BEiT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/BEiT/random_erasing.py b/image_classification/BEiT/random_erasing.py
new file mode 100644
index 00000000..faecb310
--- /dev/null
+++ b/image_classification/BEiT/random_erasing.py
@@ -0,0 +1,119 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    # patch size
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, inputs):
+        if len(inputs.shape) == 3:
+            self._erase(inputs, *inputs.shape, inputs.dtype)
+        else:
+            batch_size, chan, img_h, img_w = inputs.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(inputs[i], chan, img_h, img_w, inputs.dtype)
+        return inputs
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/BEiT/run_eval.sh b/image_classification/BEiT/run_eval.sh
new file mode 100644
index 00000000..7e83c1c9
--- /dev/null
+++ b/image_classification/BEiT/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/beit_base_patch16_384.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./beit_base_patch16_384_ft22kto1k'
diff --git a/image_classification/BEiT/run_eval_multi.sh b/image_classification/BEiT/run_eval_multi.sh
new file mode 100644
index 00000000..8041b5cd
--- /dev/null
+++ b/image_classification/BEiT/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/beit_base_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./beit_base_patch16_224_ft22kto1k'
diff --git a/image_classification/BEiT/transforms.py b/image_classification/BEiT/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/BEiT/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/BEiT/utils.py b/image_classification/BEiT/utils.py
new file mode 100644
index 00000000..f5bdb636
--- /dev/null
+++ b/image_classification/BEiT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/BoTNet/README.md b/image_classification/BoTNet/README.md
new file mode 100644
index 00000000..28c46f37
--- /dev/null
+++ b/image_classification/BoTNet/README.md
@@ -0,0 +1,165 @@
+# Bottleneck Transformers for Visual Recognition, [arxiv](https://arxiv.org/abs/2101.11605) 
+
+PaddlePaddle training/validation code and pretrained models for **BoTNet**.
+
+The official pytorch implementation is N/A. The 3rd party timm pytorch implementation is [here](rwightman/pytorch-image-models)
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+
+<p align="center">
+<img src="./img.png" alt="drawing" width="60%" height="60%"/>
+    <h4 align="center">BotNet architecture</h4>
+</p>
+
+### Update 
+* Update (2021-12-22): Initial code and ported weights are released.
+
+## Models Zoo
+| Model          | Acc@1 	| Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|----------------|----------|-------|---------|--------|------------|----------|---------------|--------------|
+| botnet50 	 | 77.38	| 93.56	| 20.9M    | 5.3G   | 224        | 0.875     | bicubic       |[google](https://drive.google.com/file/d/1S4nxgRkElT3K4lMx2JclPevmP3YUHNLw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1CW40ShBJQYeFgdBIZZLSjg)(wh13)  |
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./botnet50.pdparams`, to use the `botnet50` model in python:
+```python
+from config import get_config
+from botnet import build_botnet50 
+# config files in ./configs/
+config = get_config('./configs/botnet50.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./botnet50')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate botnet50 model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/botnet50.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./botnet50'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/botnet50.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./botnet50'
+```
+
+</details>
+
+
+## Training
+To train the botnet50 model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_singel_gpu.py \
+  -cfg='./configs/botnet50.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/botnet50.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@inproceedings{srinivas2021bottleneck,
+  title={Bottleneck transformers for visual recognition},
+  author={Srinivas, Aravind and Lin, Tsung-Yi and Parmar, Niki and Shlens, Jonathon and Abbeel, Pieter and Vaswani, Ashish},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={16519--16529},
+  year={2021}
+}
+```
diff --git a/image_classification/BoTNet/augment.py b/image_classification/BoTNet/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/BoTNet/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/BoTNet/botnet.py b/image_classification/BoTNet/botnet.py
new file mode 100644
index 00000000..6547ad03
--- /dev/null
+++ b/image_classification/BoTNet/botnet.py
@@ -0,0 +1,316 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+'''
+Implement BoTNet
+'''
+
+import paddle
+import paddle.nn as nn
+from resnet import resnet50
+
+
+def expand_dim(t, dim, k):
+    """
+
+    Expand dims for t at dim to k
+
+    """
+    t = t.unsqueeze(axis=dim)
+    expand_shape = [-1] * len(t.shape)
+    expand_shape[dim] = k
+    return paddle.expand(t, expand_shape)
+
+
+def rel_to_abs(x):
+    """
+
+    x: [B, Nh * H, L, 2L - 1]
+    Convert relative position between the key and query to their absolute position respectively.
+
+    """
+    B, Nh, L, _ = x.shape
+    # pad to shift from relative to absolute indexing
+    col_pad = paddle.zeros([B, Nh, L, 1])
+    x = paddle.concat([x, col_pad], axis=3)
+    flat_x = x.reshape([B, Nh, L * 2 * L])
+    flat_pad = paddle.zeros([B, Nh, L - 1])
+    flat_x = paddle.concat([flat_x, flat_pad], axis=2)
+    # Reshape and slice out the padded elements
+    final_x = flat_x.reshape([B, Nh, L + 1, 2 * L - 1])
+    return final_x[:, :, :L, L - 1 :]
+
+
+def relative_logits_1d(q, rel_k):
+    """
+
+    q: [B, Nh, H, W, d]
+    rel_k: [2W - 1, d]
+    Computes relative logits along one dimension.
+
+    """
+    B, Nh, H, W, _ = q.shape
+    rel_logits = paddle.matmul(q, rel_k.T)
+    # Collapse height and heads
+    rel_logits = rel_logits.reshape([-1, Nh * H, W, 2 * W - 1])
+    rel_logits = rel_to_abs(rel_logits)
+    rel_logits = rel_logits.reshape([-1, Nh, H, W, W])
+    rel_logits = expand_dim(rel_logits, dim=3, k=H)
+    return rel_logits
+
+
+class RelPosEmb(nn.Layer):
+    '''Relative position encoding'''
+    def __init__(self,
+                 height,
+                 width,
+                 dim_head):
+        super().__init__()
+
+        scale = dim_head ** -0.5
+        self.height = height
+        self.width = width
+        h_shape = [height * 2 - 1, dim_head]
+        w_shape = [width * 2 - 1, dim_head]
+        self.rel_height = paddle.create_parameter(
+            shape=h_shape, dtype='float32',
+            default_initializer=paddle.nn.initializer.Assign(paddle.randn(h_shape)*scale)
+        )
+        self.rel_width = paddle.create_parameter(
+            shape=w_shape, dtype='float32',
+            default_initializer=paddle.nn.initializer.Assign(paddle.randn(w_shape)*scale)
+        )
+
+    def forward(self, q):
+
+        H = self.height
+        W = self.width
+        B, N, _, D = q.shape
+        q = q.reshape([B, N, H, W, D]) # "B N (H W) D -> B N H W D"
+        rel_logits_w = relative_logits_1d(q, self.rel_width)
+        rel_logits_w = rel_logits_w.transpose(perm=[0, 1, 2, 4, 3, 5])
+        B, N, X, I, Y, J = rel_logits_w.shape
+        rel_logits_w = rel_logits_w.reshape([B, N, X*Y, I*J]) # "B N X I Y J-> B N (X Y) (I J)"
+
+        q = q.transpose(perm=[0, 1, 3, 2, 4]) # "B N H W D -> B N W H D"
+        rel_logits_h = relative_logits_1d(q, self.rel_height)
+        rel_logits_h = rel_logits_h.transpose(perm=[0, 1, 4, 2, 5, 3])
+        B, N, X, I, Y, J = rel_logits_h.shape
+        rel_logits_h = rel_logits_h.reshape([B, N, Y*X, J*I]) # "B N X I Y J -> B N (Y X) (J I)"
+
+        return rel_logits_w + rel_logits_h
+
+
+class BoTBlock(nn.Layer):
+    '''BoTBlock'''
+    def __init__(self,
+                 dim,
+                 fmap_size,
+                 dim_out,
+                 stride=1,
+                 heads=4,
+                 proj_factor=4,
+                 dim_qk=128,
+                 dim_v=128,
+                 activation=nn.ReLU()):
+        """
+        dim: channels in feature map
+        dim_out: output channels for feature map
+        """
+        super().__init__()
+
+        if dim != dim_out or stride != 1:
+            self.shortcut = nn.Sequential(
+                nn.Conv2D(dim, dim_out, kernel_size=1, stride=stride, bias_attr=False),
+                nn.BatchNorm2D(dim_out),
+                activation,
+            )
+        else:
+            self.shortcut = nn.Identity()
+
+        bottleneck_dimension = dim_out // proj_factor
+        attn_dim_out = heads * dim_v
+
+        self.net = nn.Sequential(
+            nn.Conv2D(dim, bottleneck_dimension, kernel_size=1, stride=1, bias_attr=False),
+            nn.BatchNorm2D(bottleneck_dimension),
+            activation,
+            MHSA(
+                dim=bottleneck_dimension,
+                fmap_size=fmap_size,
+                heads=heads,
+                dim_qk=dim_qk,
+                dim_v=dim_v,
+            ),
+            nn.AvgPool2D(2) if stride == 2 else nn.Identity(),
+            nn.BatchNorm2D(attn_dim_out),
+            activation,
+            nn.Conv2D(attn_dim_out, dim_out, kernel_size=1, stride=1, bias_attr=False),
+            nn.BatchNorm2D(dim_out),
+        )
+
+        self.activation = activation
+
+    def forward(self, featuremap):
+        shortcut = self.shortcut(featuremap)
+        featuremap = self.net(featuremap)
+        featuremap += shortcut
+        return self.activation(featuremap)
+
+
+class MHSA(nn.Layer):
+    '''Multi-Head Self-Attention'''
+    def __init__(self,
+                 dim,
+                 fmap_size,
+                 heads=4,
+                 dim_qk=128,
+                 dim_v=128):
+        """
+        dim: number of channels of feature map
+        fmap_size: [H, W]
+        dim_qk: vector dimension for q, k
+        dim_v: vector dimension for v (not necessarily the same with q, k)
+        """
+        super().__init__()
+
+        self.scale = dim_qk ** -0.5
+        self.heads = heads
+        out_channels_qk = heads * dim_qk
+        out_channels_v = heads * dim_v
+
+        self.to_qk = nn.Conv2D(dim, out_channels_qk * 2, 1, bias_attr=False)
+        self.to_v = nn.Conv2D(dim, out_channels_v, 1, bias_attr=False)
+        self.softmax = nn.Softmax(axis=-1)
+
+        height, width = fmap_size
+        self.pos_emb = RelPosEmb(height, width, dim_qk)
+
+    def transpose_multihead(self, x):
+        B, N, H, W = x.shape
+        x = x.reshape([B, self.heads, -1, H, W]) # "B (h D) H W -> B h D H W"
+        x = x.transpose(perm=[0, 1, 3, 4, 2])    # "B h D H W -> B h H W D"
+        x = x.reshape([B, self.heads, H*W, -1])  # "B h H W D -> B h (H W) D"
+        return x
+
+    def forward(self, featuremap):
+        """
+        featuremap: [B, d_in, H, W]
+        Output: [B, H, W, head * d_v]
+        """
+        B, C, H, W = featuremap.shape
+        q, k = self.to_qk(featuremap).chunk(2, axis=1)
+        v = self.to_v(featuremap)
+        q, k, v = map(self.transpose_multihead, [q, k, v])
+        q *= self.scale
+
+        logits = paddle.matmul(q, k.transpose(perm=[0, 1, 3, 2]))
+        logits += self.pos_emb(q)
+
+        weights = self.softmax(logits)
+        attn_out = paddle.matmul(weights, v)
+        a_B, a_N, a_, a_D = attn_out.shape
+        attn_out = attn_out.reshape([a_B, a_N, H, -1, a_D])  # "B N (H W) D -> B N H W D"
+        attn_out = attn_out.transpose(perm=[0, 1, 4, 2, 3])  # "B N H W D -> B N D H W"
+        attn_out = attn_out.reshape([a_B, a_N*a_D, H, -1]) # "B N D H W -> B (N D) H W"
+        return attn_out
+
+
+class BoTStack(nn.Layer):
+    '''BoTStack'''
+    def __init__(self,
+                 dim,
+                 fmap_size,
+                 dim_out=2048,
+                 heads=4,
+                 proj_factor=4,
+                 num_layers=3,
+                 stride=2,
+                 dim_qk=128,
+                 dim_v=128,
+                 activation=nn.ReLU()):
+        """
+        dim: channels in feature map
+        fmap_size: [H, W]
+        """
+        super().__init__()
+
+        self.dim = dim
+        self.fmap_size = fmap_size
+
+        layers = []
+
+        for i in range(num_layers):
+            is_first = i == 0
+            dim = dim if is_first else dim_out
+
+            fmap_divisor = 2 if stride == 2 and not is_first else 1
+            layer_fmap_size = tuple(map(lambda t: t // fmap_divisor, fmap_size))
+
+            layers.append(
+                BoTBlock(
+                    dim=dim,
+                    fmap_size=layer_fmap_size,
+                    dim_out=dim_out,
+                    stride=stride if is_first else 1,
+                    heads=heads,
+                    proj_factor=proj_factor,
+                    dim_qk=dim_qk,
+                    dim_v=dim_v,
+                    activation=activation,
+                )
+            )
+
+        self.net = nn.Sequential(*layers)
+
+    def forward(self, x):
+        _, c, h, w = x.shape
+        assert c == self.dim, f"assert {c} == self.dim {self.dim}"
+        assert h == self.fmap_size[0] and w == self.fmap_size[1]
+        return self.net(x)
+
+
+def botnet50(pretrained=False,
+             image_size=224,
+             fmap_size=(14, 14),
+             num_classes=1000,
+             embed_dim=2048,
+             **kwargs):
+    """
+    Bottleneck Transformers for Visual Recognition.
+    """
+    resnet = resnet50(pretrained=False, **kwargs)
+    layer = BoTStack(dim=1024, dim_out=embed_dim, fmap_size=fmap_size, stride=1)
+    backbone = list(resnet.children())
+    model = nn.Sequential(
+        *backbone[:-3],
+        layer,
+        nn.AdaptiveAvgPool2D([1, 1]),
+        nn.Flatten(1),
+        nn.Linear(embed_dim, num_classes),
+    )
+    if pretrained:
+        state_dict = paddle.load('botnet50.pdparams')
+        model.set_state_dict(state_dict)
+    return model
+
+
+def build_botnet50(config):
+    model = botnet50(
+        image_size=config.DATA.IMAGE_SIZE,
+        fmap_size=config.DATA.FMAP_SIZE,
+        num_classes=config.MODEL.NUM_CLASSES,
+        embed_dim=config.MODEL.TRANS.EMBED_DIM,
+    )
+    return model
diff --git a/image_classification/BoTNet/config.py b/image_classification/BoTNet/config.py
new file mode 100644
index 00000000..0c604ea3
--- /dev/null
+++ b/image_classification/BoTNet/config.py
@@ -0,0 +1,165 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.FMAP_SIZE = (14, 14)
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'BoTNet'
+_C.MODEL.NAME = 'BoTNet'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.WEIGHTS = None
+_C.MODEL.DUMMY_INPUT = False
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.EMBED_DIM = 2048
+_C.MODEL.TRANS.NUM_HEADS = 4
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 5
+_C.TRAIN.WEIGHT_DECAY = 5e-5
+_C.TRAIN.BASE_LR = 5e-4
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/BoTNet/configs/botnet50.yaml b/image_classification/BoTNet/configs/botnet50.yaml
new file mode 100644
index 00000000..a84660da
--- /dev/null
+++ b/image_classification/BoTNet/configs/botnet50.yaml
@@ -0,0 +1,17 @@
+DATA:
+  IMAGE_SIZE: 224
+  FMAP_SIZE: (14, 14)
+  CROP_PCT: 0.875
+MODEL:
+  TYPE: BoTNet
+  NAME: botnet50_224
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  TRANS:
+    EMBED_DIM: 2048
+    NUM_HEADS: 12
+TRAIN:
+  BASE_LR: 0.2
+  NUM_EPOCHS: 300
+  WARMUP_EPOCHS: 5
+  WEIGHT_DECAY: 5.0e-05
diff --git a/image_classification/BoTNet/datasets.py b/image_classification/BoTNet/datasets.py
new file mode 100644
index 00000000..cc377c90
--- /dev/null
+++ b/image_classification/BoTNet/datasets.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    aug_op_list = []
+    # random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0)))
+    # auto_augment / color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        #transforms.Resize(scale_size, 'bilinear'), # single int for resize shorter side of image
+        transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/BoTNet/droppath.py b/image_classification/BoTNet/droppath.py
new file mode 100644
index 00000000..f5d3fcaa
--- /dev/null
+++ b/image_classification/BoTNet/droppath.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor #divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    out = dp(tmp)
+#    print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/BoTNet/img.png b/image_classification/BoTNet/img.png
new file mode 100644
index 00000000..80b7a153
Binary files /dev/null and b/image_classification/BoTNet/img.png differ
diff --git a/image_classification/BoTNet/losses.py b/image_classification/BoTNet/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/BoTNet/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/BoTNet/main_multi_gpu.py b/image_classification/BoTNet/main_multi_gpu.py
new file mode 100644
index 00000000..d54afe8e
--- /dev/null
+++ b/image_classification/BoTNet/main_multi_gpu.py
@@ -0,0 +1,591 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BoTNet training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from mixup import Mixup
+from config import get_config
+from config import update_config
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from botnet import build_botnet50 as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('BoTNet')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    """main method for each process"""
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+        filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+        logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 6: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 7: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    """main method for spawning multi process training/validation"""
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/BoTNet/main_single_gpu.py b/image_classification/BoTNet/main_single_gpu.py
new file mode 100644
index 00000000..4c2d7fd5
--- /dev/null
+++ b/image_classification/BoTNet/main_single_gpu.py
@@ -0,0 +1,422 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BoTNet training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from botnet import build_botnet50 as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('BoTNet')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+    
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+    
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/BoTNet/mixup.py b/image_classification/BoTNet/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/BoTNet/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/BoTNet/random_erasing.py b/image_classification/BoTNet/random_erasing.py
new file mode 100644
index 00000000..05c90938
--- /dev/null
+++ b/image_classification/BoTNet/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    elif rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, inputs):
+        if len(inputs.shape) == 3:
+            self._erase(inputs, *inputs.shape, inputs.dtype)
+        else:
+            batch_size, chan, img_h, img_w = inputs.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(inputs[i], chan, img_h, img_w, inputs.dtype)
+        return inputs
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/BoTNet/resnet.py b/image_classification/BoTNet/resnet.py
new file mode 100644
index 00000000..1fc6a8ad
--- /dev/null
+++ b/image_classification/BoTNet/resnet.py
@@ -0,0 +1,288 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for ResNet
+"""
+
+import paddle
+import paddle.nn as nn
+
+from paddle.utils.download import get_weights_path_from_url
+
+__all__ = [
+    'ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152'
+]
+
+model_urls = {
+    'resnet18': ('https://paddle-hapi.bj.bcebos.com/models/resnet18.pdparams',
+                 'cf548f46534aa3560945be4b95cd11c4'),
+    'resnet34': ('https://paddle-hapi.bj.bcebos.com/models/resnet34.pdparams',
+                 '8d2275cf8706028345f78ac0e1d31969'),
+    'resnet50': ('https://paddle-hapi.bj.bcebos.com/models/resnet50.pdparams',
+                 'ca6f485ee1ab0492d38f323885b0ad80'),
+    'resnet101': ('https://paddle-hapi.bj.bcebos.com/models/resnet101.pdparams',
+                  '02f35f034ca3858e1e54d4036443c92d'),
+    'resnet152': ('https://paddle-hapi.bj.bcebos.com/models/resnet152.pdparams',
+                  '7ad16a2f1e7333859ff986138630fd7a'),
+}
+
+
+
+class BasicBlock(nn.Layer):
+    '''
+    basic block
+    '''
+    expansion = 1
+    def __init__(self,
+                 inplanes,
+                 planes,
+                 stride=1,
+                 downsample=None,
+                 dilation=1,
+                 norm_layer=None):
+        super(BasicBlock, self).__init__()
+        if dilation > 1:
+            raise ValueError('Basic block does not support dilation')
+        if norm_layer is None:
+            norm_layer = nn.BatchNorm2D
+        self.conv1 = nn.Conv2D(
+            inplanes, planes, 3, padding=1, stride=stride, bias_attr=False)
+        self.bn1 = norm_layer(planes)
+        self.relu = nn.ReLU()
+        self.conv2 = nn.Conv2D(
+            planes, planes, 3, padding=1, bias_attr=False)
+        self.bn2 = norm_layer(planes)
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        identity = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out += identity
+        out = self.relu(out)
+        return out
+
+
+class BottleneckBlock(nn.Layer):
+    '''
+    bottleneck block
+    '''
+    expansion = 4
+    def __init__(self,
+                 inplanes,
+                 planes,
+                 stride=1,
+                 downsample=None,
+                 groups=1,
+                 base_width=64,
+                 dilation=1,
+                 norm_layer=None):
+        super(BottleneckBlock, self).__init__()
+        if norm_layer is None:
+            norm_layer = nn.BatchNorm2D
+        width = int(planes * (base_width / 64.)) * groups
+        self.conv1 = nn.Conv2D(inplanes, width, 1, bias_attr=False)
+        self.bn1 = norm_layer(width)
+
+        self.conv2 = nn.Conv2D(width,
+                               width,
+                               3,
+                               padding=dilation,
+                               stride=stride,
+                               groups=groups,
+                               dilation=dilation,
+                               bias_attr=False)
+        self.bn2 = norm_layer(width)
+        self.conv3 = nn.Conv2D(width, planes * self.expansion, 1, bias_attr=False)
+        self.bn3 = norm_layer(planes * self.expansion)
+        self.relu = nn.ReLU()
+        self.downsample = downsample
+        self.stride = stride
+
+    def forward(self, x):
+        identity = x
+
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        if self.downsample is not None:
+            identity = self.downsample(x)
+
+        out += identity
+        out = self.relu(out)
+
+        return out
+
+
+class ResNet(nn.Layer):
+    '''
+    ResNet
+    '''
+    def __init__(self,
+                 block,
+                 depth,
+                 num_classes=1000,
+                 with_pool=True,
+                 norm_layer=None,
+                 replace_stride_with_dilation=None,
+                 dilation=1):
+        super(ResNet, self).__init__()
+        layer_cfg = {
+            18: [2, 2, 2, 2],
+            34: [3, 4, 6, 3],
+            50: [3, 4, 6, 3],
+            101: [3, 4, 23, 3],
+            152: [3, 8, 36, 3]
+        }
+
+        if replace_stride_with_dilation is None:
+            replace_stride_with_dilation = [False, False, False]
+        if len(replace_stride_with_dilation) != 3:
+            raise ValueError('replace_stride_with_dilation shoule be None or 3-element tuple')
+
+        layers = layer_cfg[depth]
+        self.num_classes = num_classes
+        self.with_pool = with_pool
+
+        if norm_layer is None:
+            norm_layer = nn.BatchNorm2D
+        self._norm_layer = norm_layer
+
+
+        self.inplanes = 64
+        self.dilation = dilation
+
+        self.conv1 = nn.Conv2D(
+            3,
+            self.inplanes,
+            kernel_size=7,
+            stride=2,
+            padding=3,
+            bias_attr=False)
+        self.bn1 = self._norm_layer(self.inplanes)
+        self.relu = nn.ReLU()
+        self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
+                                       dilate=replace_stride_with_dilation[0])
+        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
+                                       dilate=replace_stride_with_dilation[1])
+        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
+                                       dilate=replace_stride_with_dilation[2])
+        if with_pool:
+            self.avgpool = nn.AdaptiveAvgPool2D((1, 1))
+        if num_classes > 0:
+            self.fc_layer = nn.Linear(512 * block.expansion, num_classes)
+
+    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
+        norm_layer = self._norm_layer
+        downsample = None
+        previous_dilation = self.dilation
+        if dilate:
+            self.dilation *= stride
+            stride = 1
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            downsample = nn.Sequential(
+                nn.Conv2D(self.inplanes,
+                          planes*block.expansion,
+                          1,
+                          stride=stride,
+                          bias_attr=False),
+                norm_layer(planes * block.expansion),
+            )
+
+        layers = []
+        layers.append(
+            block(self.inplanes, planes, stride, downsample, 1, 64,
+                  previous_dilation, norm_layer))
+        self.inplanes = planes * block.expansion
+        for _ in range(1, blocks):
+            layers.append(block(self.inplanes, planes, norm_layer=norm_layer))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.maxpool(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+
+        if self.with_pool:
+            x = self.avgpool(x)
+        if self.num_classes > 0:
+            x = paddle.flatten(x, 1)
+            x = self._layer(x)
+
+        return x
+
+def _resnet(arch, block, depth, pretrained, **kwargs):
+    '''
+    build ResNet
+    '''
+    model = ResNet(block, depth, **kwargs)
+    if pretrained:
+        assert arch in model_urls, f"{arch} model do not have a pretrained model now"
+        weight_path = get_weights_path_from_url(model_urls[arch][0],
+                                                model_urls[arch][1])
+        param = paddle.load(weight_path)
+        model.set_dict(param)
+    return model
+
+
+def resnet18(pretrained=False, **kwargs):
+    '''
+    build ResNet18
+    '''
+    return _resnet('resnet18', BasicBlock, 18, pretrained, **kwargs)
+
+def resnet34(pretrained=False, **kwargs):
+    '''
+    build ResNet34
+    '''
+    return _resnet('resnet34', BasicBlock, 34, pretrained, **kwargs)
+
+def resnet50(pretrained=False, **kwargs):
+    '''
+    build ResNet50
+    '''
+    return _resnet('resnet50', BottleneckBlock, 50, pretrained, **kwargs)
+
+def resnet101(pretrained=False, **kwargs):
+    '''
+    build ResNet101
+    '''
+    return _resnet('resnet101', BottleneckBlock, 101, pretrained, **kwargs)
+
+def resnet152(pretrained=False, **kwargs):
+    '''
+    build ResNet152
+    '''
+    return _resnet('resnet152', BottleneckBlock, 152, pretrained, **kwargs)
diff --git a/image_classification/BoTNet/run_eval.sh b/image_classification/BoTNet/run_eval.sh
new file mode 100644
index 00000000..94a02647
--- /dev/null
+++ b/image_classification/BoTNet/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/botnet50_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=4 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./botnet50' \
diff --git a/image_classification/BoTNet/run_eval_multi.sh b/image_classification/BoTNet/run_eval_multi.sh
new file mode 100644
index 00000000..d1740548
--- /dev/null
+++ b/image_classification/BoTNet/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/botnet50_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./botnet50_new' \
diff --git a/image_classification/BoTNet/run_train.sh b/image_classification/BoTNet/run_train.sh
new file mode 100644
index 00000000..3fad1e39
--- /dev/null
+++ b/image_classification/BoTNet/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/botnet50.yaml' \
+-dataset='imagenet2012' \
+-batch_size=16 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/BoTNet/run_train_multi.sh b/image_classification/BoTNet/run_train_multi.sh
new file mode 100644
index 00000000..058c70b0
--- /dev/null
+++ b/image_classification/BoTNet/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/botnet50.yaml' \
+-dataset='imagenet2012' \
+-batch_size=16 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/BoTNet/stat_define.py b/image_classification/BoTNet/stat_define.py
new file mode 100644
index 00000000..f1b3be1c
--- /dev/null
+++ b/image_classification/BoTNet/stat_define.py
@@ -0,0 +1,61 @@
+import os
+import glob
+import paddle
+from config import get_config
+from botnet import build_botnet50 as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+#cfg = './configs/xcit_nano_12_p8_224.yaml'
+#input_size = (1, 3, 224, 224)
+#cfg = './configs/xcit_large_24_p16_384.yaml'
+#input_size = (1, 3, 384, 384)
+#config = get_config(cfg)
+#model = build_model(config)
+
+#custom_ops = {paddle.nn.GELU: count_gelu,
+#              paddle.nn.LayerNorm: count_layernorm,
+#              paddle.nn.Softmax: count_softmax,
+#            }
+#print(os.path.basename(cfg))
+#paddle.flops(model,
+#             input_size=input_size,
+#             custom_ops=custom_ops,
+#             print_detail=False)
+
+
+for cfg in glob.glob('./configs/*.yaml'):
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+    print('-----------')
diff --git a/image_classification/BoTNet/utils.py b/image_classification/BoTNet/utils.py
new file mode 100644
index 00000000..ab0345aa
--- /dev/null
+++ b/image_classification/BoTNet/utils.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/CSwin/README.md b/image_classification/CSwin/README.md
index a1363e7f..f8be714c 100644
--- a/image_classification/CSwin/README.md
+++ b/image_classification/CSwin/README.md
@@ -16,17 +16,18 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): Model FLOPs and # params are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| cswin_tiny_224  | 82.81  | 96.30 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1l-JY0u7NGyD6SjkyiyNnDx3wFFT1nAYO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L5FqU7ImWAhQHAlSilqVAw)(4q3h) |
-| cswin_small_224 | 83.60  | 96.58 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/10eEBk3wvJdQ8Dy58LvQ11Wk1K2UfPy-E/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FiaNiWyAuWu1IBsUFLUaAw)(gt1a) |
-| cswin_base_224  | 84.23  | 96.91 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YufKh3DKol4-HrF-I22uiorXSZDIXJmZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koy8hXyGwvgAfUxdlkWofg)(wj8p) |
-| cswin_large_224 | 86.52  | 97.99 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1V1hteGK27t1nI84Ac7jdWfydBLLo7Fxt/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KgIX6btML6kPiPGkIzvyVA)(b5fs) |
-| cswin_base_384  | 85.51  | 97.48 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1qCaFItzFoTYBo-4UbGzL6M5qVDGmJt4y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WNkY7o_vP9KJ8cd5c7n2sQ)(rkf5) |
-| cswin_large_384 | 87.49  | 98.35 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LRN_6qUz71yP-OAOpN4Lscb8fkUytMic/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1eCIpegPj1HIbJccPMaAsew)(6235) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| cswin_tiny_224  				| 82.81 | 96.30 | 22.3M   | 4.2G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1l-JY0u7NGyD6SjkyiyNnDx3wFFT1nAYO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L5FqU7ImWAhQHAlSilqVAw)(4q3h) |
+| cswin_small_224 				| 83.60 | 96.58 | 34.6M   | 6.5G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/10eEBk3wvJdQ8Dy58LvQ11Wk1K2UfPy-E/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FiaNiWyAuWu1IBsUFLUaAw)(gt1a) |
+| cswin_base_224  				| 84.23 | 96.91 | 77.4M   | 14.6G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YufKh3DKol4-HrF-I22uiorXSZDIXJmZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koy8hXyGwvgAfUxdlkWofg)(wj8p) |
+| cswin_base_384  				| 85.51 | 97.48 | 77.4M   | 43.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1qCaFItzFoTYBo-4UbGzL6M5qVDGmJt4y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WNkY7o_vP9KJ8cd5c7n2sQ)(rkf5) |
+| cswin_large_224 				| 86.52 | 97.99 | 173.3M  | 32.5G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1V1hteGK27t1nI84Ac7jdWfydBLLo7Fxt/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KgIX6btML6kPiPGkIzvyVA)(b5fs) |
+| cswin_large_384 				| 87.49 | 98.35 | 173.3M  | 96.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LRN_6qUz71yP-OAOpN4Lscb8fkUytMic/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1eCIpegPj1HIbJccPMaAsew)(6235) | 
 
 > *The results are evaluated on ImageNet2012 validation set.
 
@@ -71,8 +72,8 @@ from cswin import build_cswin as build_model
 config = get_config('./configs/cswin_base_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./cswin_base_224')
+# load pretrained weights
+model_state_dict = paddle.load('./cswin_base_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -85,12 +86,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/cswin_base_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/cswin_base_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./cswin_base_224'
+    -pretrained=/path/to/pretrained/model/cswin_base_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -107,12 +108,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/cswin_base_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/cswin_base_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./cswin_base_224'
+    -pretrained=/path/to/pretrained/model/cswin_base_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -126,10 +127,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/cswin_base_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/cswin_base_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train \
 ```
 
 <details>
@@ -146,10 +147,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/cswin_base_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/cswin_base_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train \
 ```
 
 </details>
diff --git a/image_classification/CSwin/__init__.py b/image_classification/CSwin/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/CSwin/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/CSwin/augment.py b/image_classification/CSwin/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/CSwin/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/CSwin/config.py b/image_classification/CSwin/config.py
index 2aeb03c8..a11bd6e2 100644
--- a/image_classification/CSwin/config.py
+++ b/image_classification/CSwin/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -34,7 +34,9 @@
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size
 _C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -43,8 +45,8 @@
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
-_C.MODEL.DROPPATH = 0.0
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.2
 _C.MODEL.ATTENTION_DROPOUT = 0.0
 
 # transformer settings
@@ -63,13 +65,16 @@
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 5e-4
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99992
+_C.TRAIN.LINEAR_SCALED_LR = None 
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -83,28 +88,34 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
-# augmentation
-_C.AUG = CN()
-_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
-_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
-_C.AUG.RE_PROB = 0.25 # random earse prob
-_C.AUG.RE_MODE = 'pixel' # random earse mode
-_C.AUG.RE_COUNT = 1 # random earse count
-_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
-_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
-_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
-_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
-_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
-_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
 
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 20 # freq to save chpt
+_C.SAVE_FREQ = 1 # freq to save chpt
 _C.REPORT_FREQ = 50 # freq to logging info
-_C.VALIDATE_FREQ = 20 # freq to do validation
-_C.SEED = 0
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 42
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -122,6 +133,7 @@ def _update_config_from_file(config, cfg_file):
     config.merge_from_file(cfg_file)
     config.freeze()
 
+
 def update_config(config, args):
     """Update config by ArgumentParser
     Args:
@@ -138,8 +150,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -151,6 +167,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/CSwin/configs/cswin_base_224.yaml b/image_classification/CSwin/configs/cswin_base_224.yaml
index 0a3d906a..3c572d58 100644
--- a/image_classification/CSwin/configs/cswin_base_224.yaml
+++ b/image_classification/CSwin/configs/cswin_base_224.yaml
@@ -4,9 +4,17 @@ DATA:
 MODEL:
     TYPE: cswin
     NAME: cswin_base_224
+    DROPPATH: 0.5
     TRANS:
         PATCH_SIZE: 4
         EMBED_DIM: 96
         DEPTHS: [2, 4, 32, 2]
         SPLIT_SIZES: [1, 2, 7, 7]
         NUM_HEADS: [4, 8, 16, 32]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 20
+    BASE_LR: 1e-3
+    WEIGHT_DECAY: 0.1
+    MODEL_EMA: True
+    MODEL_EMA_DECAY: 0.99992
diff --git a/image_classification/CSwin/configs/cswin_small_224.yaml b/image_classification/CSwin/configs/cswin_small_224.yaml
index f5cf5ab1..5048294a 100644
--- a/image_classification/CSwin/configs/cswin_small_224.yaml
+++ b/image_classification/CSwin/configs/cswin_small_224.yaml
@@ -4,9 +4,17 @@ DATA:
 MODEL:
     TYPE: cswin
     NAME: cswin_small_224
+    DROPPATH: 0.4
     TRANS:
         PATCH_SIZE: 4
         EMBED_DIM: 64
         DEPTHS: [2, 4, 32, 2]
         SPLIT_SIZES: [1, 2, 7, 7]
         NUM_HEADS: [2, 4, 8, 16]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 20
+    BASE_LR: 2e-3
+    WEIGHT_DECAY: 0.05
+    MODEL_EMA: True
+    MODEL_EMA_DECAY: 0.99984
diff --git a/image_classification/CSwin/configs/cswin_tiny_224.yaml b/image_classification/CSwin/configs/cswin_tiny_224.yaml
index 77f643b9..a308647c 100644
--- a/image_classification/CSwin/configs/cswin_tiny_224.yaml
+++ b/image_classification/CSwin/configs/cswin_tiny_224.yaml
@@ -4,9 +4,18 @@ DATA:
 MODEL:
     TYPE: cswin
     NAME: cswin_tiny_224
+    DROPPATH: 0.2
     TRANS:
         PATCH_SIZE: 4
         EMBED_DIM: 64
         DEPTHS: [1, 2, 21, 1]
         SPLIT_SIZES: [1, 2, 7, 7]
         NUM_HEADS: [2, 4, 8, 16]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 20
+    BASE_LR: 2e-3
+    WEIGHT_DECAY: 0.05
+    MODEL_EMA: True
+    MODEL_EMA_DECAY: 0.99984
+
diff --git a/image_classification/CSwin/configs/cswin_tiny_224_finetune.yaml b/image_classification/CSwin/configs/cswin_tiny_224_finetune.yaml
new file mode 100644
index 00000000..a09b2620
--- /dev/null
+++ b/image_classification/CSwin/configs/cswin_tiny_224_finetune.yaml
@@ -0,0 +1,19 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: cswin
+    NAME: cswin_tiny_224
+    DROPPATH: 0.1
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIM: 64
+        DEPTHS: [1, 2, 21, 1]
+        SPLIT_SIZES: [1, 2, 7, 7]
+        NUM_HEADS: [2, 4, 8, 16]
+TRAIN:
+    NUM_EPOCHS: 30
+    WARMUP_EPOCHS: 0
+    BASE_LR: 1e-5
+    WEIGHT_DECAY: 1e-8
+
diff --git a/image_classification/CSwin/cswin.py b/image_classification/CSwin/cswin.py
index 86c40d30..3ac93c1a 100644
--- a/image_classification/CSwin/cswin.py
+++ b/image_classification/CSwin/cswin.py
@@ -55,7 +55,21 @@ def __init__(self, patch_stride=4, in_channels=3, embed_dim=96):
                                      kernel_size=7,
                                      stride=patch_stride,
                                      padding=2)
-        self.norm = nn.LayerNorm(embed_dim)
+
+        w_attr, b_attr = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(embed_dim,
+                                 weight_attr=w_attr,
+                                 bias_attr=b_attr)
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
+ 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
 
     def forward(self, x):
         x = self.patch_embed(x) # [batch, embed_dim, h, w], h = w = image_size / 4
@@ -95,8 +109,8 @@ def __init__(self, in_features, hidden_features, dropout):
         self.dropout = nn.Dropout(dropout)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
-        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
         return weight_attr, bias_attr
 
     def forward(self, x):
@@ -276,8 +290,15 @@ def __init__(self,
         self.dim_head = dim // num_heads
         self.mlp_ratio = mlp_ratio
         self.split_size = split_size
-        self.norm1 = nn.LayerNorm(dim)
-        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm1 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_1,
+                                  bias_attr=b_attr_1)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.qkv = nn.Linear(dim,
+                             dim * 3,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2 if qkv_bias else False)
         self.attns = nn.LayerList()
         self.split_heads = split_heads
 
@@ -300,13 +321,31 @@ def __init__(self,
             # NOTE: may need to change for different H and W
             splits[0], splits[1] = splits[1], splits[0]
 
-        self.proj = nn.Linear(dim, dim)
+        w_attr_3, b_attr_3 = self._init_weights()
+        self.proj = nn.Linear(dim,
+                             dim,
+                             weight_attr=w_attr_3,
+                             bias_attr=b_attr_3)
         self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
-        self.norm2 = nn.LayerNorm(dim)
+
+        w_attr_4, b_attr_4 = self._init_weights_layernorm()
+        self.norm2 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_4,
+                                  bias_attr=b_attr_4)
         self.mlp = Mlp(in_features=dim,
                        hidden_features=int(dim * mlp_ratio),
                        dropout=dropout)
 
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
+ 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
+
     def chunk_qkv(self, x, chunks=1, axis=-1):
         x = x.chunk(chunks, axis=axis)
         return x
@@ -347,7 +386,16 @@ def __init__(self, dim_in, dim_out):
                               kernel_size=3,
                               stride=2,
                               padding=1)
-        self.norm = nn.LayerNorm(dim_out)
+
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(dim_out,
+                                 weight_attr=w_attr_1,
+                                 bias_attr=b_attr_1)
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
 
     def forward(self, x):
         B, HW, C = x.shape
@@ -484,8 +532,25 @@ def __init__(self,
                 dim = dim * 2
                 resolution = resolution // 2
         # last norm and classification head layers
-        self.norm = nn.LayerNorm(dim)
-        self.head = nn.Linear(dim, num_classes)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(dim,
+                                 weight_attr=w_attr_1,
+                                 bias_attr=b_attr_1)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.head = nn.Linear(dim,
+                              num_classes,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
+ 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.))
+        return weight_attr, bias_attr
 
     def forward_features(self, x):
         x = self.patch_embedding(x)
@@ -515,5 +580,5 @@ def build_cswin(config):
                              qk_scale=config.MODEL.TRANS.QK_SCALE,
                              dropout=config.MODEL.DROPOUT,
                              attention_dropout=config.MODEL.ATTENTION_DROPOUT,
-                             droppath=config.MODEL.DROP_PATH)
+                             droppath=config.MODEL.DROPPATH)
     return model
diff --git a/image_classification/CSwin/datasets.py b/image_classification/CSwin/datasets.py
index eeb16f89..7e178b57 100644
--- a/image_classification/CSwin/datasets.py
+++ b/image_classification/CSwin/datasets.py
@@ -20,8 +20,19 @@
 import os
 import math
 from PIL import Image
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -81,13 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -107,12 +141,10 @@ def get_val_transforms(config):
 
     scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
     transforms_val = transforms.Compose([
-        # scale_size must be single int, which will resize the shorter side of image
-        transforms.Resize(scale_size, 'bicubic'),
+        transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/CSwin/droppath.py b/image_classification/CSwin/droppath.py
index 72b012d0..08472aea 100644
--- a/image_classification/CSwin/droppath.py
+++ b/image_classification/CSwin/droppath.py
@@ -32,7 +32,7 @@ def drop_path(self, inputs):
         Args:
             input: tensor with arbitrary shape
             drop_prob: float number of drop path probability, default: 0.0
-            training: bool, set if current mode is training, default: False
+            training: bool, if current mode is training, default: False
         Returns:
             output: output tensor after drop path
         """
diff --git a/image_classification/CSwin/losses.py b/image_classification/CSwin/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/CSwin/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/CSwin/main_multi_gpu.py b/image_classification/CSwin/main_multi_gpu.py
index 77c94fd0..5c6bbec5 100644
--- a/image_classification/CSwin/main_multi_gpu.py
+++ b/image_classification/CSwin/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""CSwin Transformer training/validation using multiple GPU """
+"""CSwin training/validation using multiple GPU """
 
 import sys
 import os
@@ -25,52 +25,57 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from cswin import build_cswin as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from model_ema import ModelEma
+from cswin import build_cswin as build_model
 
 
-parser = argparse.ArgumentParser('CSwin Transformer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CSwin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -78,83 +83,157 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -169,56 +248,144 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -240,7 +407,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -267,79 +436,132 @@ def main_worker(*args):
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -347,15 +569,38 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/CSwin/main_single_gpu.py b/image_classification/CSwin/main_single_gpu.py
index 2662c0e5..731bfc3d 100644
--- a/image_classification/CSwin/main_single_gpu.py
+++ b/image_classification/CSwin/main_single_gpu.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""CSwin Transformer training/validation using single GPU """
+"""CSwin training/validation using single GPU """
 
 import sys
 import os
@@ -26,53 +26,55 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from cswin import build_cswin as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from model_ema import ModelEma
+from cswin import build_cswin as build_model
 
 
-parser = argparse.ArgumentParser('CSwin Transformer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CSwin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,56 +82,88 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
+
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        if model_ema is not None:
+            model_ema.update(model)
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -138,19 +172,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -175,7 +210,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -187,24 +222,81 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -213,8 +305,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -226,9 +317,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -248,58 +339,76 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
-            grad_clip=clip)
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams')
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt')
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
-        optimizer.set_dict(opt_state)
+        optimizer.set_state_dict(opt_state)
         logger.info(
-            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -311,9 +420,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
@@ -327,6 +437,11 @@ def main():
             paddle.save(optimizer.state_dict(), model_path + '.pdopt')
             logger.info(f"----- Save model: {model_path}.pdparams")
             logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 if __name__ == "__main__":
diff --git a/image_classification/CSwin/mixup.py b/image_classification/CSwin/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/CSwin/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/CSwin/model_ema.py b/image_classification/CSwin/model_ema.py
new file mode 100644
index 00000000..d12383b2
--- /dev/null
+++ b/image_classification/CSwin/model_ema.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.module.to('cpu')
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/CSwin/port_weights/__init__.py b/image_classification/CSwin/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/CSwin/random_erasing.py b/image_classification/CSwin/random_erasing.py
new file mode 100644
index 00000000..662c4e62
--- /dev/null
+++ b/image_classification/CSwin/random_erasing.py
@@ -0,0 +1,118 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/CSwin/run_eval_multi_tiny.sh b/image_classification/CSwin/run_eval_multi_tiny.sh
index b9c3c8f7..4b54d4b2 100644
--- a/image_classification/CSwin/run_eval_multi_tiny.sh
+++ b/image_classification/CSwin/run_eval_multi_tiny.sh
@@ -2,7 +2,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/cswin_tiny_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=64 \
+-batch_size=128 \
 -data_path='/dataset/imagenet' \
 -eval \
--pretrained='./cswin_tiny_224' \
+-pretrained='./output/my-train-20211012-21-08-50/cswin-Epoch-285-Loss-2.8287537443471553-EMA'
+#-pretrained='./cswin_tiny_224' \
diff --git a/image_classification/CSwin/run_train_tiny.sh b/image_classification/CSwin/run_train_tiny.sh
index 0203ffda..025283d3 100644
--- a/image_classification/CSwin/run_train_tiny.sh
+++ b/image_classification/CSwin/run_train_tiny.sh
@@ -1,6 +1,7 @@
-CUDA_VISIBLE_DEVICES=7 \
+CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
 -cfg='./configs/cswin_tiny_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=64 \
+-batch_size=100 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/CSwin/run_train_tiny_multi.sh b/image_classification/CSwin/run_train_tiny_multi.sh
index c86fc37a..8e0bc9b2 100644
--- a/image_classification/CSwin/run_train_tiny_multi.sh
+++ b/image_classification/CSwin/run_train_tiny_multi.sh
@@ -1,6 +1,7 @@
-CUDA_VISIBLE_DEVICES=0,1,2,3 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python main_multi_gpu.py \
 -cfg='./configs/cswin_tiny_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=64 \
+-batch_size=100 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/CSwin/run_train_tiny_multi_finetune.sh b/image_classification/CSwin/run_train_tiny_multi_finetune.sh
new file mode 100644
index 00000000..4219c794
--- /dev/null
+++ b/image_classification/CSwin/run_train_tiny_multi_finetune.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/cswin_tiny_224_finetune.yaml' \
+-dataset='imagenet2012' \
+-batch_size=100 \
+-data_path='/dataset/imagenet' \
+-amp
+-pretrained='./XXX'
diff --git a/image_classification/CSwin/run_train_tiny_multi_resume.sh b/image_classification/CSwin/run_train_tiny_multi_resume.sh
new file mode 100644
index 00000000..55fc3bf4
--- /dev/null
+++ b/image_classification/CSwin/run_train_tiny_multi_resume.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/cswin_tiny_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=100 \
+-data_path='/dataset/imagenet' \
+-resume='./output/train-20211012-21-08-50/cswin-Epoch-285-Loss-2.8287537443471553' \
+-last_epoch=285
diff --git a/image_classification/CSwin/stat.py b/image_classification/CSwin/stat.py
new file mode 100644
index 00000000..7c751ddb
--- /dev/null
+++ b/image_classification/CSwin/stat.py
@@ -0,0 +1,64 @@
+import os
+import glob
+import paddle
+from config import get_config
+from cswin import build_cswin as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*_384.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    input_size = (1, 3, 384, 384)
+    #input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/CSwin/transforms.py b/image_classification/CSwin/transforms.py
new file mode 100644
index 00000000..676fe1ff
--- /dev/null
+++ b/image_classification/CSwin/transforms.py
@@ -0,0 +1,13 @@
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/CaiT/README.md b/image_classification/CaiT/README.md
index b9ed741d..082ecf10 100644
--- a/image_classification/CaiT/README.md
+++ b/image_classification/CaiT/README.md
@@ -14,14 +14,22 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): More weights are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| cait_xxs24_224                 | 78.38 | 94.32 | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LKsQUr824oY4E42QeUEaFt41I8xHNseR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YIaBLopKIK5_p7NlgWHpGA)(j9m8) |
-| cait_s24_384                   | 85.05 | 97.34 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1GU0esukDvMg3u40FZB_5GiB6qpShjvGh/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qvhNckJjcEf5HyVn8LuEeA)(qb86) |
-| cait_m48_448                   | 86.49  | 97.75 | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lJSP__dVERBNFnp7im-1xM3s_lqEe82-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/179MA3MkG2qxFle0K944Gkg)(imk5) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| cait_xxs24_224                | 78.38 | 94.32 | 11.9M   | 2.2G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1LKsQUr824oY4E42QeUEaFt41I8xHNseR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YIaBLopKIK5_p7NlgWHpGA)(j9m8) |
+| cait_xxs36_224                | 79.75 | 94.88 | 17.2M   | 33.1G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zZx4aQJPJElEjN5yejUNsocPsgnd_3tS/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1pdyFreRRXUn0yPel00-62Q)(nebg) |
+| cait_xxs24_384                | 80.97 | 95.64 | 11.9M   | 6.8G   | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1J27ipknh_kwqYwR0qOqE9Pj3_bTcTx95/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1uYSDzROqCVT7UdShRiiDYg)(2j95) |
+| cait_xxs36_384                | 82.20 | 96.15 | 17.2M   | 10.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/13IvgI3QrJDixZouvvLWVkPY0J6j0VYwL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GafA8B6T3h_vtmNNq2HYKg)(wx5d) |
+| cait_s24_224                  | 83.45 | 96.57 | 46.8M   | 8.7G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1sdCxEw328yfPJArf6Zwrvok-91gh7PhS/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BPsAMEcrjtnbOnVDQwZJYw)(m4pn) |
+| cait_xs24_384                 | 84.06 | 96.89 | 26.5M   | 15.1G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zKL6cZwqmvuRMci-17FlKk-lA-W4RVte/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w10DPJvK8EwhOCm-tZUpww)(scsv) |
+| cait_s24_384                  | 85.05 | 97.34 | 46.8M   | 26.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1klqBDhJDgw28omaOpgzInMmfeuDa7NAi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-aNO6c7Ipm9x1hJY6N6G2g)(dnp7) |
+| cait_s36_384                  | 85.45 | 97.48 | 68.1M   | 39.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1m-55HryznHbiUxG38J2rAa01BYcjxsRZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-uWg-JHLEKeMukFFctoufg)(e3ui) |
+| cait_m36_384                  | 86.06 | 97.73 | 270.7M  | 156.2G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1WJjaGiONX80KBHB3YN8mNeusPs3uDhR2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1aZ9bEU5AycmmfmHAqZIaLA)(r4hu) |
+| cait_m48_448                  | 86.49 | 97.75 | 355.8M  | 287.3G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lJSP__dVERBNFnp7im-1xM3s_lqEe82-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/179MA3MkG2qxFle0K944Gkg)(imk5) |
 
 > *The results are evaluated on ImageNet2012 validation set.
 
@@ -66,8 +74,8 @@ from cait import build_cait as build_model
 config = get_config('./configs/cait_xxs24_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./cait_xxs24_224')
+# load pretrained weights
+model_state_dict = paddle.load('./cait_xxs24_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -80,12 +88,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/cait_xxs24_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/cait_xxs24_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./cait_xxs24_224'
+    -pretrained=/path/to/pretrained/model/cait_xxs24_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -102,12 +110,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/cait_xxs24_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/cait_xxs24_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./cait_xxs24_224'
+    -pretrained=/path/to/pretrained/model/cait_xxs24_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -121,10 +129,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/cait_xxs24_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/cait_xxs24_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train \
 ```
 
 <details>
@@ -141,10 +149,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/cait_xxs24_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/cait_xxs24_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train \
 ```
 
 </details>
diff --git a/image_classification/CaiT/__init__.py b/image_classification/CaiT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/CaiT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/CaiT/augment.py b/image_classification/CaiT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/CaiT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/CaiT/cait.py b/image_classification/CaiT/cait.py
index d8038106..3fd982ab 100644
--- a/image_classification/CaiT/cait.py
+++ b/image_classification/CaiT/cait.py
@@ -104,8 +104,8 @@ def __init__(self, in_features, hidden_features, dropout=0.):
         self.dropout = nn.Dropout(dropout)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
-        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
@@ -144,15 +144,24 @@ def __init__(self,
         self.dim_head = dim // num_heads
         self.scale = qk_scale or self.dim_head ** -0.5
 
-        self.q = nn.Linear(dim, dim, bias_attr=qkv_bias)
-        self.k = nn.Linear(dim, dim, bias_attr=qkv_bias)
-        self.v = nn.Linear(dim, dim, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.q = nn.Linear(dim, dim, weight_attr=w_attr_1, bias_attr=b_attr_1, if qkv_bias else False)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.k = nn.Linear(dim, dim, weight_attr=w_attr_2, bias_attr=b_attr_2, if qkv_bias else False)
+        w_attr_3, b_attr_3 = self._init_weights()
+        self.v = nn.Linear(dim, dim, weight_attr=w_attr_3, bias_attr=b_attr_3, if qkv_bias else False)
 
         self.attn_dropout = nn.Dropout(attention_dropout)
-        self.proj = nn.Linear(dim, dim)
+        w_attr_4, b_attr_4 = self._init_weights()
+        self.proj = nn.Linear(dim, dim, weight_attr=w_attr_4, bias_attr=b_attr_4)
         self.proj_dropout = nn.Dropout(dropout)
         self.softmax = nn.Softmax(axis=-1)
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         B, N, C = x.shape
 
@@ -206,15 +215,24 @@ def __init__(self,
         self.dim_head = dim // num_heads
         self.scale = self.dim_head ** -0.5
 
-        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(dim, dim * 3, weight_attr=w_attr_1, bias_attr=b_attr_1 if qkv_bias else False)
         self.attn_dropout = nn.Dropout(attention_dropout)
         self.softmax = nn.Softmax(axis=-1)
-        self.proj = nn.Linear(dim, dim)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.proj = nn.Linear(dim, dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
         self.proj_dropout = nn.Dropout(dropout)
 
         # talking head
-        self.proj_l = nn.Linear(num_heads, num_heads)
-        self.proj_w = nn.Linear(num_heads, num_heads)
+        w_attr_3, b_attr_3 = self._init_weights()
+        self.proj_l = nn.Linear(num_heads, num_heads, weight_attr=w_attr_3, bias_attr=b_attr_3)
+        w_attr_4, b_attr_4 = self._init_weights()
+        self.proj_w = nn.Linear(num_heads, num_heads, weight_attr=w_attr_4, bias_attr=b_attr_4)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
     def transpose_multihead(self, x):
         new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
@@ -280,14 +298,16 @@ def __init__(self,
                  droppath=0.,
                  init_values=1e-4):
         super().__init__()
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.norm1 = nn.LayerNorm(dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
         self.attn = ClassAttention(dim,
                                    num_heads=num_heads,
                                    qkv_bias=qkv_bias,
                                    dropout=dropout,
                                    attention_dropout=attention_dropout)
         self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
-        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.norm2 = nn.LayerNorm(dim, weight_attr=w_attr_2, bias_attr=b_attr_2, epsilon=1e-6)
         self.mlp = Mlp(in_features=dim,
                        hidden_features=int(dim * mlp_ratio),
                        dropout=dropout)
@@ -301,6 +321,11 @@ def __init__(self,
             dtype='float32',
             default_initializer=nn.initializer.Constant(init_values))
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x, x_cls):
         u = paddle.concat([x_cls, x], axis=1)
 
@@ -346,14 +371,16 @@ def __init__(self,
                  droppath=0.,
                  init_values=1e-4):
         super().__init__()
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.norm1 = nn.LayerNorm(dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
         self.attn = TalkingHeadAttention(dim,
                                          num_heads=num_heads,
                                          qkv_bias=qkv_bias,
                                          dropout=dropout,
                                          attention_dropout=attention_dropout)
         self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
-        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.norm2 = nn.LayerNorm(dim, weight_attr=w_attr_2, bias_attr=b_attr_2, epsilon=1e-6)
         self.mlp = Mlp(in_features=dim,
                        hidden_features=int(dim * mlp_ratio),
                        dropout=dropout)
@@ -367,6 +394,11 @@ def __init__(self,
             dtype='float32',
             default_initializer=nn.initializer.Constant(init_values))
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         h = x
         x = self.norm1(x)
@@ -469,8 +501,23 @@ def __init__(self,
             layer_list.append(copy.deepcopy(block_layers))
         self.blocks_token_only = nn.LayerList(layer_list)
 
-        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6)
-        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()
+        w_attr_1, b_attr_1 = self._init_weights_norm()
+        self.norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights_linear()
+        self.head = nn.Linear(embed_dim,
+                              num_classes,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2) if num_classes > 0 else Identity()
+
+    def _init_weights_norm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights_linear(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
     def forward_features(self, x):
         # Patch Embedding
@@ -498,10 +545,18 @@ def forward(self, x):
 def build_cait(config):
     """build cait model using config"""
     model = Cait(image_size=config.DATA.IMAGE_SIZE,
+                 num_classes=config.MODEL.NUM_CLASSES,
+                 in_channels=config.MODEL.TRANS.IN_CHANNELS,
                  patch_size=config.MODEL.TRANS.PATCH_SIZE,
                  embed_dim=config.MODEL.TRANS.EMBED_DIM,
                  depth=config.MODEL.TRANS.DEPTH,
                  num_heads=config.MODEL.TRANS.NUM_HEADS,
                  mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
-                 qkv_bias=config.MODEL.TRANS.QKV_BIAS)
+                 qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                 dropout=config.MODEL.DROPOUT,
+                 attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+                 droppath=config.MODEL.DROPPATH,
+                 init_values=config.MODEL.TRANS.INIT_VALUES,
+                 mlp_ratio_class_token=config.MODEL.TRANS.MLP_RATIO,
+                 depth_token_only=config.MODEL.TRANS.DEPTH_TOKEN_ONLY):
     return model
diff --git a/image_classification/CaiT/config.py b/image_classification/CaiT/config.py
index 163a1fcd..99f4e221 100644
--- a/image_classification/CaiT/config.py
+++ b/image_classification/CaiT/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -34,7 +34,9 @@
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
 _C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads 
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -43,7 +45,8 @@
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
 _C.MODEL.ATTENTION_DROPOUT = 0.0
 
 # transformer settings
@@ -56,20 +59,23 @@
 _C.MODEL.TRANS.MLP_RATIO = 4.0
 _C.MODEL.TRANS.NUM_HEADS = 4
 _C.MODEL.TRANS.QKV_BIAS = True
-_C.MODEL.TRANS.INIT_VALUES = 1e-5
+_C.MODEL.TRANS.INIT_VALUES = 1e-4
 
 
 # training settings
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 5e-4
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.WARMUP_EPOCHS = 5
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.0005
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -83,14 +89,38 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8 # mixup alpha, enabled if >0
+_C.TRAIN.CUTMIX_ALPHA = 1.0 # cutmix alpha, enabled if >0
+_C.TRAIN.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.TRAIN.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.TRAIN.MIXUP_MODE = 'batch' # how to apply mixup/cutmix params, per 'batch', 'pair' or 'elem'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+_C.TRAIN.DISTILLATION_TYPE = 'hard' # hard, soft, none 
+_C.TRAIN.DISTILLATION_ALPHA = 0.5
+_C.TRAIN.DISTILLATION_TAU = 1.0
+
+
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 5 # freq to save chpt
-_C.REPORT_FREQ = 100 # freq to logging info
-_C.VALIDATE_FREQ = 100 # freq to do validation
-_C.SEED = 0
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 42
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -124,8 +154,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -137,6 +171,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/CaiT/configs/cait_m36_384.yaml b/image_classification/CaiT/configs/cait_m36_384.yaml
new file mode 100644
index 00000000..8317c049
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_m36_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+    
+MODEL:
+    TYPE: cait
+    NAME: cait_m36_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        DEPTH: 36
+        NUM_HEADS: 16
+        INIT_VALUES: 1e-6
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/configs/cait_s24_224.yaml b/image_classification/CaiT/configs/cait_s24_224.yaml
new file mode 100644
index 00000000..497e992e
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_s24_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+    
+MODEL:
+    TYPE: cait
+    NAME: cait_s24_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        DEPTH: 24
+        NUM_HEADS: 8
+        INIT_VALUES: 1e-5
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/configs/cait_s24_384.yaml b/image_classification/CaiT/configs/cait_s24_384.yaml
index 9e042574..a2aca032 100644
--- a/image_classification/CaiT/configs/cait_s24_384.yaml
+++ b/image_classification/CaiT/configs/cait_s24_384.yaml
@@ -4,7 +4,7 @@ DATA:
     
 MODEL:
     TYPE: cait
-    NAME: cait_s24_284
+    NAME: cait_s24_384
     TRANS:
         PATCH_SIZE: 16
         EMBED_DIM: 384
diff --git a/image_classification/CaiT/configs/cait_s36_384.yaml b/image_classification/CaiT/configs/cait_s36_384.yaml
new file mode 100644
index 00000000..5707cf7d
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_s36_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+    
+MODEL:
+    TYPE: cait
+    NAME: cait_s36_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        DEPTH: 36
+        NUM_HEADS: 8
+        INIT_VALUES: 1e-6
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/configs/cait_xs24_384.yaml b/image_classification/CaiT/configs/cait_xs24_384.yaml
new file mode 100644
index 00000000..1caf796d
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_xs24_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: cait
+    NAME: cait_xs24_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 288
+        DEPTH: 24
+        NUM_HEADS: 6
+        MLP_RATIO: 4.0
+        QKV_BIAS: True
+        INIT_VALUES: 1e-5
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/configs/cait_xxs24_384.yaml b/image_classification/CaiT/configs/cait_xxs24_384.yaml
new file mode 100644
index 00000000..d83a8702
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_xxs24_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: cait
+    NAME: cait_xxs24_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        DEPTH: 24
+        NUM_HEADS: 4
+        MLP_RATIO: 4.0
+        QKV_BIAS: True
+        INIT_VALUES: 1e-5
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/configs/cait_xxs36_224.yaml b/image_classification/CaiT/configs/cait_xxs36_224.yaml
new file mode 100644
index 00000000..a12475f1
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_xxs36_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: cait
+    NAME: cait_xxs36_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        DEPTH: 36
+        NUM_HEADS: 4
+        MLP_RATIO: 4.0
+        QKV_BIAS: True
+        INIT_VALUES: 1e-5
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/configs/cait_xxs36_384.yaml b/image_classification/CaiT/configs/cait_xxs36_384.yaml
new file mode 100644
index 00000000..44f907eb
--- /dev/null
+++ b/image_classification/CaiT/configs/cait_xxs36_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: cait
+    NAME: cait_xxs36_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        DEPTH: 36
+        NUM_HEADS: 4
+        MLP_RATIO: 4.0
+        QKV_BIAS: True
+        INIT_VALUES: 1e-5
+        DEPTH_TOKEN_ONLY: 2
diff --git a/image_classification/CaiT/datasets.py b/image_classification/CaiT/datasets.py
index 66afc611..e06767df 100644
--- a/image_classification/CaiT/datasets.py
+++ b/image_classification/CaiT/datasets.py
@@ -19,8 +19,15 @@
 
 import os
 import math
+from PIL import Image
 from paddle.io import Dataset, DataLoader, DistributedBatchSampler
 from paddle.vision import transforms, datasets, image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -60,7 +67,7 @@ def __len__(self):
         return len(self.label_list)
 
     def __getitem__(self, index):
-        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = Image.open(self.img_path_list[index]).convert('RGB')
         data = self.transform(data)
         label = self.label_list[index]
 
@@ -79,14 +86,36 @@ def get_train_transforms(config):
     Returns:
         transforms_train: training transforms
     """
-
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0)))
+    # auto_augment / color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -109,8 +138,7 @@ def get_val_transforms(config):
         transforms.Resize(scale_size, 'bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/CaiT/losses.py b/image_classification/CaiT/losses.py
new file mode 100644
index 00000000..04377eac
--- /dev/null
+++ b/image_classification/CaiT/losses.py
@@ -0,0 +1,144 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, label smoothing rate
+        x: tensor, predictions (default is before softmax) with shape [N, num_classes] as default
+        target: tensor, target label with shape [N] as default
+        weight: tensor, optional, a manual rescaling weight given to each class        
+        reduction: str, optional, indicate how to average the loss by batch_size,
+                   default is ``'mean'``, the candicates are ``'none'`` | ``'mean'`` | ``'sum'``
+        axis: int, optional, the index of dimension to perform softmax calculations,
+                   default is ``-1``, if `axis` is not -1 -> the shape of x and target may not be default
+        use_softmax: bool, optional, if `use_softmax` is ``False``, ``x`` should be after softmax,
+                     default is ``True``, the candicates are ``True`` | ``False``
+        name: str, optional, the name of the operator, default is ``None``,
+              for more information, please refer to :ref:`api_guide_Name`.
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self,
+                 smoothing=0.1,
+                 weight=None,                 
+                 reduction='mean',                 
+                 axis=-1,
+                 use_softmax=True,
+                 name=None):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.weight = weight
+        self.reduction = reduction        
+        self.axis = axis
+        self.use_softmax = use_softmax
+        self.name = name
+
+    def forward(self, x, target):
+        target = paddle.nn.functional.one_hot(target, num_classes=x.shape[1])
+        target = paddle.nn.functional.label_smooth(target, epsilon=self.smoothing)        
+        loss = paddle.nn.functional.cross_entropy(
+            x,
+            target,            
+            weight=self.weight,            
+            reduction=self.reduction,
+            soft_label=True,
+            axis=self.axis,
+            use_softmax=self.use_softmax,
+            name=self.name)
+        return loss
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/CaiT/main_multi_gpu.py b/image_classification/CaiT/main_multi_gpu.py
index d14970f7..f274d0df 100644
--- a/image_classification/CaiT/main_multi_gpu.py
+++ b/image_classification/CaiT/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -25,52 +25,57 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from cait import build_cait as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from model_ema import ModelEma
+from cait import build_cait as build_model
 
 
-parser = argparse.ArgumentParser('CaiT')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CaiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -78,83 +83,157 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -169,56 +248,144 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -240,7 +407,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -267,79 +436,132 @@ def main_worker(*args):
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -347,15 +569,38 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/CaiT/main_single_gpu.py b/image_classification/CaiT/main_single_gpu.py
index 5432c23b..6d1cb7a0 100644
--- a/image_classification/CaiT/main_single_gpu.py
+++ b/image_classification/CaiT/main_single_gpu.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -26,53 +26,54 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from cait import build_cait as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from cait import build_cait as build_model
 
 
-parser = argparse.ArgumentParser('CaiT')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CaiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,56 +81,87 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        if model_ema is not None:
+            model_ema.update(model)
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -138,19 +170,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -175,7 +208,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -187,23 +220,81 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -212,8 +303,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -225,9 +315,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -247,58 +337,76 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
-            grad_clip=clip)
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
-            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -310,9 +418,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
@@ -326,6 +435,11 @@ def main():
             paddle.save(optimizer.state_dict(), model_path + '.pdopt')
             logger.info(f"----- Save model: {model_path}.pdparams")
             logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 if __name__ == "__main__":
diff --git a/image_classification/CaiT/mixup.py b/image_classification/CaiT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/CaiT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/CaiT/model_ema.py b/image_classification/CaiT/model_ema.py
new file mode 100644
index 00000000..8a636765
--- /dev/null
+++ b/image_classification/CaiT/model_ema.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/CaiT/port_weights/__init__.py b/image_classification/CaiT/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/CaiT/port_weights/load_pytorch_weights.py b/image_classification/CaiT/port_weights/load_pytorch_weights.py
new file mode 100644
index 00000000..6c4d1af3
--- /dev/null
+++ b/image_classification/CaiT/port_weights/load_pytorch_weights.py
@@ -0,0 +1,204 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from cait import *
+from config import *
+
+
+model_name = "cait_s24_224"
+#model_name = "cait_xxs36_224"
+sz = int(model_name[-3::]) 
+
+
+config = get_config()
+parser = argparse.ArgumentParser('')
+parser.add_argument('-cfg', type=str, default=f'./configs/{model_name}.yaml')
+#parser.add_argument('-cfg', type=str, default='./configs/cait_m36_384.yaml')
+parser.add_argument('-dataset', type=str, default=None)
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-image_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-eval', action="store_true")
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+args = parser.parse_args()
+
+config = get_config()
+config = update_config(config, args)
+print(config)
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('cls_token', 'cls_token'),
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.patch_embed'),
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        pp_prefix = f'blocks.{idx}'
+        th_prefix = f'blocks.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.gamma_1', f'{pp_prefix}.gamma_1'),
+            (f'{th_prefix}.gamma_2', f'{pp_prefix}.gamma_2'),
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'), 
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'), 
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.attn.proj_l', f'{pp_prefix}.attn.proj_l'),
+            (f'{th_prefix}.attn.proj_w', f'{pp_prefix}.attn.proj_w'),
+        ]
+        mapping.extend(layer_mapping)
+
+    num_layers = config.MODEL.TRANS.DEPTH_TOKEN_ONLY
+    for idx in range(num_layers):
+        pp_prefix = f'blocks_token_only.{idx}'
+        th_prefix = f'blocks_token_only.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.gamma_1', f'{pp_prefix}.gamma_1'),
+            (f'{th_prefix}.gamma_2', f'{pp_prefix}.gamma_2'),
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'), 
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'), 
+            (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+            (f'{th_prefix}.attn.k', f'{pp_prefix}.attn.k'),
+            (f'{th_prefix}.attn.v', f'{pp_prefix}.attn.v'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head')
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+
+    #paddle.set_device('cpu')
+    paddle_model = build_cait(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print('--------------')
+    print_model_named_buffers(paddle_model)
+    print('----------------------------------')
+
+    #device = torch.device('cpu')
+    device = torch.device('cuda')
+    torch_model = timm.create_model(model_name, pretrained=True)
+    #torch_model = timm.create_model('cait_m36_384', pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+    print_model_named_params(torch_model)
+    print('--------------')
+    print_model_named_buffers(torch_model)
+    print('----------------------------------')
+
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    #x = np.random.randn(2, 3, 384, 384).astype('float32')
+    #x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+    
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    #model_path = os.path.join('./cait_m36_384.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CaiT/random_erasing.py b/image_classification/CaiT/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/CaiT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/CaiT/run_eval_multi.sh b/image_classification/CaiT/run_eval_multi.sh
index e0732977..9cd2aa88 100644
--- a/image_classification/CaiT/run_eval_multi.sh
+++ b/image_classification/CaiT/run_eval_multi.sh
@@ -1,9 +1,9 @@
 CUDA_VISIBLE_DEVICES=4,5,6,7 \
 python main_multi_gpu.py \
--cfg='./configs/cait_xxs24_224.yaml' \
+-cfg='./configs/cait_s24_224.yaml' \
 -dataset='imagenet2012' \
 -batch_size=32 \
 -data_path='/dataset/imagenet' \
 -eval \
--pretrained='./cait_xxs24_224' \
+-pretrained='./cait_s24_224' \
 -ngpus=4
diff --git a/image_classification/CaiT/run_eval_multi_tmp.sh b/image_classification/CaiT/run_eval_multi_tmp.sh
new file mode 100644
index 00000000..efffcc84
--- /dev/null
+++ b/image_classification/CaiT/run_eval_multi_tmp.sh
@@ -0,0 +1,10 @@
+#CUDA_VISIBLE_DEVICES=0,1,2,3 \
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/cait_xxs36_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./cait_xxs36_224' \
+-ngpus=4
diff --git a/image_classification/CaiT/run_train.sh b/image_classification/CaiT/run_train.sh
index 369ada22..237aa5ef 100644
--- a/image_classification/CaiT/run_train.sh
+++ b/image_classification/CaiT/run_train.sh
@@ -3,4 +3,5 @@ python main_single_gpu.py \
 -cfg='./configs/cait_xxs24_224.yaml' \
 -dataset='imagenet2012' \
 -batch_size=4 \
--data_path='/dataset/imagenet'
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/CaiT/run_train_multi.sh b/image_classification/CaiT/run_train_multi.sh
index 33d4b09d..c64e92b4 100644
--- a/image_classification/CaiT/run_train_multi.sh
+++ b/image_classification/CaiT/run_train_multi.sh
@@ -2,5 +2,6 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 \
 python main_multi_gpu.py \
 -cfg='./configs/cait_xxs24_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=4 \
 -data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/CaiT/stat.py b/image_classification/CaiT/stat.py
new file mode 100644
index 00000000..e88d3122
--- /dev/null
+++ b/image_classification/CaiT/stat.py
@@ -0,0 +1,64 @@
+import os
+import glob
+import paddle
+from config import get_config
+from cait import build_cait as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/cait_s24_224.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    #input_size = (1, 3, 384, 384)
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/ConvMLP/README.md b/image_classification/ConvMLP/README.md
new file mode 100644
index 00000000..50fbc826
--- /dev/null
+++ b/image_classification/ConvMLP/README.md
@@ -0,0 +1,172 @@
+# ConvMLP: Hierarchical Convolutional MLPs for Vision, [arxiv](https://arxiv.org/abs/2109.04454) 
+
+PaddlePaddle training/validation code and pretrained models for **ConvMLP**.
+
+The official and 3rd party pytorch implementation are [here](https://github.com/SHI-Labs/Convolutional-MLPs).
+
+
+This implementation is developed by [PPViT](https://github.com/xperzy/PPViT/tree/master).
+
+<p align="center">
+<img src="./convmlp.png" alt="drawing" width="90%" height="90%"/>
+<h4 align="center">ViP Model Overview</h4>
+</p>
+
+
+
+### Update 
+Update (2021-09-26): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| convmlp_s			  			| 76.76 | 93.40 | 9.0M    | 2.4G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1D8kWVfQxOyyktqDixaZoGXB3wVspzjlc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1WseHYALFB4Of3Dajmlt45g)(3jz3) |
+| convmlp_m			  			| 79.03 | 94.53 | 17.4M   | 4.0G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TqVlKHq-WRdT9KDoUpW3vNJTIRZvix_m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1koipCAffG6REUyLYk0rGAQ)(vyp1) |
+| convmlp_l			  			| 80.15 | 95.00 | 42.7M   | 10.0G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1KXxYogDh6lD3QGRtFBoX5agfz81RDN3l/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1f1aEeVoySzImI89gkjcaOA)(ne5x) |
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+> 
+> Note: ConvMLP weights are ported from [here](https://github.com/SHI-Labs/Convolutional-MLPs)
+
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./convmlp_s.pdparams`, to use the `convmlp_s` model in python:
+```python
+from config import get_config
+from convmlp import build_convmlp as build_model
+# config files in ./configs/
+config = get_config('./configs/convmlp_s.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./convmlp_s7.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate ConvMLP model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/convmlp_s.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/convmlp_s  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/convmlp_s.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/convmlp_s  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the ConvMLP Transformer model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/convmlp_s.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=32 \
+  -data_path=/path/to/dataset/imagenet/train \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/convmlp_s.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train \ 
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{li2021convmlp,
+      title={ConvMLP: Hierarchical Convolutional MLPs for Vision}, 
+      author={Jiachen Li and Ali Hassani and Steven Walton and Humphrey Shi},
+      year={2021},
+      eprint={2109.04454},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/image_classification/ConvMLP/__init__.py b/image_classification/ConvMLP/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/ConvMLP/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/ConvMLP/augment.py b/image_classification/ConvMLP/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/ConvMLP/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/ConvMLP/config.py b/image_classification/ConvMLP/config.py
new file mode 100644
index 00000000..fee70c77
--- /dev/null
+++ b/image_classification/ConvMLP/config.py
@@ -0,0 +1,178 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'ConvMLP'
+_C.MODEL.NAME = 'ConvMLP'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+# transformer settings
+_C.MODEL.MIXER = CN()
+_C.MODEL.MIXER.BLOCKS = [2, 4, 2]
+_C.MODEL.MIXER.DIMS = [128, 256, 512]
+_C.MODEL.MIXER.MLP_RATIOS = [2, 2, 2]
+_C.MODEL.MIXER.CHANNELS = 64
+_C.MODEL.MIXER.N_CONV_BLOCKS = 2
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/ConvMLP/configs/convmlp_l.yaml b/image_classification/ConvMLP/configs/convmlp_l.yaml
new file mode 100644
index 00000000..cb472824
--- /dev/null
+++ b/image_classification/ConvMLP/configs/convmlp_l.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ConvMLP
+    NAME: convmlp_l
+    MIXER:
+        BLOCKS: [4, 8, 3]
+        DIMS: [192, 384, 768]
+        MLP_RATIOS: [3, 3, 3]
+        CHANNELS: 96
+        N_CONV_BLOCKS: 3
+    DROP_PATH: 0.1
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 4e-5
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 512
diff --git a/image_classification/ConvMLP/configs/convmlp_m.yaml b/image_classification/ConvMLP/configs/convmlp_m.yaml
new file mode 100644
index 00000000..e47727c9
--- /dev/null
+++ b/image_classification/ConvMLP/configs/convmlp_m.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ConvMLP
+    NAME: convmlp_m
+    MIXER:
+        BLOCKS: [3, 6, 3]
+        DIMS: [128, 256, 512]
+        MLP_RATIOS: [3, 3, 3]
+        CHANNELS: 64
+        N_CONV_BLOCKS: 3
+    DROP_PATH: 0.1
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 4e-5
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 512
diff --git a/image_classification/ConvMLP/configs/convmlp_s.yaml b/image_classification/ConvMLP/configs/convmlp_s.yaml
new file mode 100644
index 00000000..d8be7da8
--- /dev/null
+++ b/image_classification/ConvMLP/configs/convmlp_s.yaml
@@ -0,0 +1,23 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+    BATCH_SIZE: 128
+MODEL:
+    TYPE: ConvMLP
+    NAME: convmlp_s
+    MIXER:
+        BLOCKS: [2, 4, 2]
+        DIMS: [128, 256, 512]
+        MLP_RATIOS: [2, 2, 2]
+        CHANNELS: 64
+        N_CONV_BLOCKS: 2
+    DROP_PATH: 0.1
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 4e-5
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 512
+
diff --git a/image_classification/ConvMLP/convmlp.png b/image_classification/ConvMLP/convmlp.png
new file mode 100644
index 00000000..1459e796
Binary files /dev/null and b/image_classification/ConvMLP/convmlp.png differ
diff --git a/image_classification/ConvMLP/convmlp.py b/image_classification/ConvMLP/convmlp.py
new file mode 100644
index 00000000..92b40004
--- /dev/null
+++ b/image_classification/ConvMLP/convmlp.py
@@ -0,0 +1,342 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for ConvMLP
+"""
+
+import paddle
+import paddle.nn as nn
+from droppath import DropPath
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+kaiming_normal_ = nn.initializer.KaimingNormal()
+
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, inputs):
+        return inputs
+
+
+class ConvTokenizer(nn.Layer):
+    def __init__(self, embedding_dim=64):
+        super(ConvTokenizer, self).__init__()
+        self.block = nn.Sequential(
+            nn.Conv2D(
+                3,
+                embedding_dim // 2,
+                kernel_size=(3, 3),
+                stride=(2, 2),
+                padding=(1, 1),
+                bias_attr=False,
+            ),
+            nn.BatchNorm2D(embedding_dim // 2),
+            nn.ReLU(),
+            nn.Conv2D(
+                embedding_dim // 2,
+                embedding_dim // 2,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias_attr=False,
+            ),
+            nn.BatchNorm2D(embedding_dim // 2),
+            nn.ReLU(),
+            nn.Conv2D(
+                embedding_dim // 2,
+                embedding_dim,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias_attr=False,
+            ),
+            nn.BatchNorm2D(embedding_dim),
+            nn.ReLU(),
+            nn.MaxPool2D(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)),
+        )
+
+    def forward(self, x):
+        return self.block(x)
+
+
+class ConvStage(nn.Layer):
+    def __init__(self,
+                 num_blocks=2,
+                 embedding_dim_in=64,
+                 hidden_dim=128,
+                 embedding_dim_out=128):
+        super().__init__()
+        self.conv_blocks = nn.LayerList()
+        for i in range(num_blocks):
+            block = nn.Sequential(
+                nn.Conv2D(
+                    embedding_dim_in,
+                    hidden_dim,
+                    kernel_size=(1, 1),
+                    stride=(1, 1),
+                    padding=(0, 0),
+                    bias_attr=False,
+                ),
+                nn.BatchNorm2D(hidden_dim),
+                nn.ReLU(),
+                nn.Conv2D(
+                    hidden_dim,
+                    hidden_dim,
+                    kernel_size=(3, 3),
+                    stride=(1, 1),
+                    padding=(1, 1),
+                    bias_attr=False,
+                ),
+                nn.BatchNorm2D(hidden_dim),
+                nn.ReLU(),
+                nn.Conv2D(
+                    hidden_dim,
+                    embedding_dim_in,
+                    kernel_size=(1, 1),
+                    stride=(1, 1),
+                    padding=(0, 0),
+                    bias_attr=False,
+                ),
+                nn.BatchNorm2D(embedding_dim_in),
+                nn.ReLU(),
+            )
+            self.conv_blocks.append(block)
+        self.downsample = nn.Conv2D(
+            embedding_dim_in,
+            embedding_dim_out,
+            kernel_size=(3, 3),
+            stride=(2, 2),
+            padding=(1, 1),
+        )
+
+    def forward(self, x):
+        for block in self.conv_blocks:
+            x = x + block(x)
+        return self.downsample(x)
+
+
+class Mlp(nn.Layer):
+    def __init__(self,
+                 embedding_dim_in,
+                 hidden_dim=None,
+                 embedding_dim_out=None,
+                 activation=nn.GELU):
+        super().__init__()
+        hidden_dim = hidden_dim or embedding_dim_in
+        embedding_dim_out = embedding_dim_out or embedding_dim_in
+        self.fc1 = nn.Linear(embedding_dim_in, hidden_dim)
+        self.act = activation()
+        self.fc2 = nn.Linear(hidden_dim, embedding_dim_out)
+
+    def forward(self, x):
+        return self.fc2(self.act(self.fc1(x)))
+
+
+class ConvMLPStage(nn.Layer):
+    def __init__(self, embedding_dim, dim_feedforward=2048, stochastic_depth_rate=0.1):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(embedding_dim)
+        self.channel_mlp1 = Mlp(
+            embedding_dim_in=embedding_dim, hidden_dim=dim_feedforward
+        )
+        self.norm2 = nn.LayerNorm(embedding_dim)
+        self.connect = nn.Conv2D(
+            embedding_dim,
+            embedding_dim,
+            kernel_size=(3, 3),
+            stride=(1, 1),
+            padding=(1, 1),
+            groups=embedding_dim,
+            bias_attr=False,
+        )
+        self.connect_norm = nn.LayerNorm(embedding_dim)
+        self.channel_mlp2 = Mlp(
+            embedding_dim_in=embedding_dim, hidden_dim=dim_feedforward
+        )
+        self.drop_path = (
+            DropPath(stochastic_depth_rate) if stochastic_depth_rate > 0 else Identity()
+        )
+
+    def forward(self, src):
+        src = src + self.drop_path(self.channel_mlp1(self.norm1(src)))
+        src = self.connect(self.connect_norm(src).transpose([0, 3, 1, 2])).transpose(
+            [0, 2, 3, 1]
+        )
+        src = src + self.drop_path(self.channel_mlp2(self.norm2(src)))
+        return src
+
+
+class ConvDownsample(nn.Layer):
+    def __init__(self, embedding_dim_in, embedding_dim_out):
+        super().__init__()
+        self.downsample = nn.Conv2D(
+            embedding_dim_in,
+            embedding_dim_out,
+            kernel_size=(3, 3),
+            stride=(2, 2),
+            padding=(1, 1),
+        )
+
+    def forward(self, x):
+        x = x.transpose([0, 3, 1, 2])
+        x = self.downsample(x)
+        return x.transpose([0, 2, 3, 1])
+
+
+class BasicStage(nn.Layer):
+    def __init__(self,
+                 num_blocks,
+                 embedding_dims,
+                 mlp_ratio=1,
+                 stochastic_depth_rate=0.1,
+                 downsample=True):
+        super().__init__()
+        self.blocks = nn.LayerList()
+        dpr = [x.item() for x in paddle.linspace(0, stochastic_depth_rate, num_blocks)]
+        for i in range(num_blocks):
+            block = ConvMLPStage(
+                embedding_dim=embedding_dims[0],
+                dim_feedforward=int(embedding_dims[0] * mlp_ratio),
+                stochastic_depth_rate=dpr[i],
+            )
+            self.blocks.append(block)
+
+        self.downsample_mlp = (
+            ConvDownsample(embedding_dims[0], embedding_dims[1])
+            if downsample
+            else Identity()
+        )
+
+    def forward(self, x):
+        for blk in self.blocks:
+            x = blk(x)
+        x = self.downsample_mlp(x)
+        return x
+
+
+class ConvMLP(nn.Layer):
+    def __init__(self,
+                 blocks,
+                 dims,
+                 mlp_ratios,
+                 channels=64,
+                 n_conv_blocks=3,
+                 classifier_head=True,
+                 num_classes=1000,
+                 droppath=0.,
+                 *args,
+                 **kwargs):
+        super().__init__()
+        assert (
+            len(blocks) == len(mlp_ratios) == len(mlp_ratios)
+        ), f"blocks, dims and mlp_ratios must agree in size"
+
+        self.tokenizer = ConvTokenizer(embedding_dim=channels)
+        self.conv_stages = ConvStage(
+            n_conv_blocks,
+            embedding_dim_in=channels,
+            hidden_dim=dims[0],
+            embedding_dim_out=dims[0],
+        )
+
+        self.stages = nn.LayerList()
+        for i,block in enumerate(blocks):
+            stage = BasicStage(
+                num_blocks=block,
+                embedding_dims=dims[i : i + 2],
+                mlp_ratio=mlp_ratios[i],
+                stochastic_depth_rate=droppath,
+                downsample=(i + 1 < len(blocks)),
+            )
+            self.stages.append(stage)
+        if classifier_head:
+            self.norm = nn.LayerNorm(dims[-1])
+            self.head = nn.Linear(dims[-1], num_classes)
+        else:
+            self.head = None
+        self.apply(self.init_weight)
+
+    def forward(self, x):
+        x = self.tokenizer(x)
+        x = self.conv_stages(x)
+        x = x.transpose([0, 2, 3, 1])
+        for stage in self.stages:
+            x = stage(x)
+        if self.head is None:
+            return x
+        B, _, _, C = x.shape
+        x = x.reshape([B, -1, C])
+        x = self.norm(x)
+        x = x.mean(axis=1)
+        x = self.head(x)
+        return x
+
+    def init_weight(self, m):
+        if isinstance(m, (nn.Linear, nn.Conv1D)):
+            trunc_normal_(m.weight)
+            if isinstance(m, (nn.Linear, nn.Conv1D)) and m.bias is not None:
+                zeros_(m.bias)
+        elif isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+        elif isinstance(m, nn.Conv2D):
+            kaiming_normal_(m.weight)
+        elif isinstance(m, nn.BatchNorm2D):
+            ones_(m.weight)
+            zeros_(m.bias)
+
+
+def build_convmlp(config):
+    model = ConvMLP(
+        blocks=config.MODEL.MIXER.BLOCKS,
+        dims=config.MODEL.MIXER.DIMS,
+        mlp_ratios=config.MODEL.MIXER.MLP_RATIOS,
+        channels=config.MODEL.MIXER.CHANNELS,
+        n_conv_blocks=config.MODEL.MIXER.N_CONV_BLOCKS,
+        classifier_head=True,
+        num_classes=config.MODEL.NUM_CLASSES,
+        droppath=config.MODEL.DROP_PATH,
+    )
+    return model
+
+
+def convmlp_m(**kwargs):
+    model = ConvMLP(
+        blocks=[3, 6, 3],
+        dims=[128, 256, 512],
+        mlp_ratios=[3, 3, 3],
+        channels=64,
+        n_conv_blocks=3,
+        classifier_head=True,
+        num_classes=1000,
+    )
+    return model
+
+
+def convmlp_l(**kwargs):
+    model = ConvMLP(
+        blocks=[4, 8, 3],
+        dims=[192, 384, 768],
+        mlp_ratios=[3, 3, 3],
+        channels=96,
+        n_conv_blocks=3,
+        classifier_head=True,
+        num_classes=1000,
+    )
+    return model
diff --git a/image_classification/ConvMLP/datasets.py b/image_classification/ConvMLP/datasets.py
new file mode 100644
index 00000000..304df9a3
--- /dev/null
+++ b/image_classification/ConvMLP/datasets.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/ConvMLP/droppath.py b/image_classification/ConvMLP/droppath.py
new file mode 100644
index 00000000..c8fe8048
--- /dev/null
+++ b/image_classification/ConvMLP/droppath.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/ConvMLP/losses.py b/image_classification/ConvMLP/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/ConvMLP/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/ConvMLP/main_multi_gpu.py b/image_classification/ConvMLP/main_multi_gpu.py
new file mode 100644
index 00000000..e91d5efd
--- /dev/null
+++ b/image_classification/ConvMLP/main_multi_gpu.py
@@ -0,0 +1,581 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ConvMLP training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from convmlp import build_convmlp as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ConvMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg
+        train_acc_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/ConvMLP/main_single_gpu.py b/image_classification/ConvMLP/main_single_gpu.py
new file mode 100644
index 00000000..27e8de97
--- /dev/null
+++ b/image_classification/ConvMLP/main_single_gpu.py
@@ -0,0 +1,423 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ConvMLP training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from convmlp import build_convmlp as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ConvMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/ConvMLP/mixup.py b/image_classification/ConvMLP/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/ConvMLP/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/ConvMLP/model_ema.py b/image_classification/ConvMLP/model_ema.py
new file mode 100644
index 00000000..e5ea7480
--- /dev/null
+++ b/image_classification/ConvMLP/model_ema.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.module.to('cpu')
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
diff --git a/image_classification/ConvMLP/random_erasing.py b/image_classification/ConvMLP/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/ConvMLP/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/ConvMLP/run_eval.sh b/image_classification/ConvMLP/run_eval.sh
new file mode 100644
index 00000000..33ff0d36
--- /dev/null
+++ b/image_classification/ConvMLP/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/convmlp_s.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./convmlp_s'
diff --git a/image_classification/ConvMLP/run_eval_multi.sh b/image_classification/ConvMLP/run_eval_multi.sh
new file mode 100644
index 00000000..266661d9
--- /dev/null
+++ b/image_classification/ConvMLP/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/convmlp_l.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./convmlp_l'
diff --git a/image_classification/ConvMLP/run_train.sh b/image_classification/ConvMLP/run_train.sh
new file mode 100644
index 00000000..5d9eb788
--- /dev/null
+++ b/image_classification/ConvMLP/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/convmlp_s.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/ConvMLP/run_train_multi.sh b/image_classification/ConvMLP/run_train_multi.sh
new file mode 100644
index 00000000..87f70083
--- /dev/null
+++ b/image_classification/ConvMLP/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/convmlp_s.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/ConvMLP/stat.py b/image_classification/ConvMLP/stat.py
new file mode 100644
index 00000000..4e65f3bd
--- /dev/null
+++ b/image_classification/ConvMLP/stat.py
@@ -0,0 +1,65 @@
+import os
+import glob
+import paddle
+from config import get_config
+from convmlp import build_convmlp as build_model
+
+def count_gelu(layer, inputs, output):
+    activation_flops = 8
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, inputs, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, inputs, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    #input_size = (1, 3, 384, 384)
+    #input_size = (1, 3, 256, 256)
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/ConvMLP/transforms.py b/image_classification/ConvMLP/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/ConvMLP/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/ConvMLP/utils.py b/image_classification/ConvMLP/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/ConvMLP/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/ConvMixer/README.md b/image_classification/ConvMixer/README.md
new file mode 100644
index 00000000..6797ef02
--- /dev/null
+++ b/image_classification/ConvMixer/README.md
@@ -0,0 +1,155 @@
+# ConvMixer: Patches Are All You Need? 🤷, [OpenReview](https://openreview.net/forum?id=TVHS5Y4dNvM)
+
+PaddlePaddle training/validation code and pretrained models for **ConvMixer**.
+
+The official pytorch implementation is [here](https://github.com/tmp-iclr/convmixer).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./convmixer.png" alt="drawing" width="90%" height="90%"/>
+<h4 align="center">ConvMixer Model Overview</h4>
+</p>
+
+
+
+### Update 
+- Update (2021-11-04): Model weights are updated.
+- Update (2021-10-13): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| convmixer_1024_20  			| 76.94 | 93.35 | 24.5M   | 9.5G   |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/1R7zUSl6_6NFFdNOe8tTfoR9VYQtGfD7F/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DgGA3qYu4deH4woAkvjaBw)(qpn9) |
+| convmixer_768_32  			| 80.16 | 95.08 | 21.2M   | 20.8G  |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/196Lg_Eet-hRj733BYASj22g51wdyaW2a/view?usp=sharing)/[baidu](https://pan.baidu.com/s/17CbRNzY2Sy_Cu7cxNAkWmQ)(m5s5) |
+| convmixer_1536_20  			| 81.37 | 95.62 | 51.8M   | 72.4G  |    224     | 0.96     | bicubic       | [google](https://drive.google.com/file/d/1-LlAlADiu0SXDQmE34GN2GBhqI-RYRqO/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1R-gSzhzQNfkuZVxsaE4vEw)(xqty) |
+> *The results are evaluated on ImageNet2012 validation set.
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./convmixer_768_32.pdparams`, to use the `convmixer_768_32` model in python:
+```python
+from config import get_config
+from convmixer import build_convmixer as build_model
+# config files in ./configs/
+config = get_config('./configs/convmixer_768_32.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./convmixer_768_32.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate ConvMixer model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/convmixer_768_32.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/convmixer_768_32  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/convmixer_768_32.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/convmixer_768_32  # .pdparams is NOT needed
+```
+
+</details>
+
+
+## Training
+To train the ConvMixer Transformer model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/convmixer_768_32.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=16 \
+  -data_path=/path/to/dataset/imagenet/train \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+    -cfg=./configs/convmixer_768_32.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=32 \
+    -data_path=/path/to/dataset/imagenet/train \
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
diff --git a/image_classification/ConvMixer/__init__.py b/image_classification/ConvMixer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/ConvMixer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/ConvMixer/augment.py b/image_classification/ConvMixer/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/ConvMixer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/ConvMixer/config.py b/image_classification/ConvMixer/config.py
new file mode 100644
index 00000000..8e54481a
--- /dev/null
+++ b/image_classification/ConvMixer/config.py
@@ -0,0 +1,177 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+from paddle import nn
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'convmixer'
+_C.MODEL.NAME = 'convmixer_768_32'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+
+# cnn settings
+_C.MODEL.CNN = CN()
+_C.MODEL.CNN.DIM = 1536
+_C.MODEL.CNN.DEPTH = 20
+_C.MODEL.CNN.KERNEL_SIZE = 9
+_C.MODEL.CNN.PATCH_SIZE = 7
+_C.MODEL.CNN.ACTIVATION = 'GELU'
+
+# training settings
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 150
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.01
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/ConvMixer/configs/convmixer_1024_20.yaml b/image_classification/ConvMixer/configs/convmixer_1024_20.yaml
new file mode 100644
index 00000000..aa80ff9d
--- /dev/null
+++ b/image_classification/ConvMixer/configs/convmixer_1024_20.yaml
@@ -0,0 +1,22 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.96
+MODEL:
+    TYPE: convmixer
+    NAME: convmixer_1024_20
+    CNN:
+        DIM: 1024
+        DEPTH: 20
+        KERNEL_SIZE: 9
+        PATCH_SIZE: 14
+        ACTIVATION: 'GELU'
+TRAIN:
+    BASE_LR: 0.01
+    WARMUP_EPOCHS: 0
+    CUTMIX_ALPHA: 0.5
+    MIXUP_ALPHA: 0.5
+    RANDOM_ERASE_PROB: 0.25
+    GRAD_CLIP: 1.0
+    OPTIMIZER:
+        EPS: 1e-3
+    AUTO_AUGMENT: True
diff --git a/image_classification/ConvMixer/configs/convmixer_1536_20.yaml b/image_classification/ConvMixer/configs/convmixer_1536_20.yaml
new file mode 100644
index 00000000..cc413aba
--- /dev/null
+++ b/image_classification/ConvMixer/configs/convmixer_1536_20.yaml
@@ -0,0 +1,22 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.96
+MODEL:
+    TYPE: convmixer
+    NAME: convmixer_1536_20
+    CNN:
+        DIM: 1536
+        DEPTH: 20
+        KERNEL_SIZE: 9
+        PATCH_SIZE: 7
+        ACTIVATION: 'GELU'
+TRAIN:
+     BASE_LR: 0.01
+     WARMUP_EPOCHS: 0
+     CUTMIX_ALPHA: 0.5
+     MIXUP_ALPHA: 0.5
+     RANDOM_ERASE_PROB: 0.25
+     GRAD_CLIP: 1.0
+     OPTIMIZER:
+         EPS: 1e-3
+     AUTO_AUGMENT: True
diff --git a/image_classification/ConvMixer/configs/convmixer_768_32.yaml b/image_classification/ConvMixer/configs/convmixer_768_32.yaml
new file mode 100644
index 00000000..53c8525f
--- /dev/null
+++ b/image_classification/ConvMixer/configs/convmixer_768_32.yaml
@@ -0,0 +1,22 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.96
+MODEL:
+    TYPE: convmixer
+    NAME: convmixer_768_32
+    CNN:
+        DIM: 768
+        DEPTH: 32
+        KERNEL_SIZE: 7
+        PATCH_SIZE: 7
+        ACTIVATION: 'ReLU'
+TRAIN:
+     BASE_LR: 0.01
+     WARMUP_EPOCHS: 0
+     CUTMIX_ALPHA: 0.5
+     MIXUP_ALPHA: 0.5
+     RANDOM_ERASE_PROB: 0.25
+     GRAD_CLIP: 1.0
+     OPTIMIZER:
+         EPS: 1e-3
+     AUTO_AUGMENT: True
diff --git a/image_classification/ConvMixer/convmixer.png b/image_classification/ConvMixer/convmixer.png
new file mode 100644
index 00000000..8accf133
Binary files /dev/null and b/image_classification/ConvMixer/convmixer.png differ
diff --git a/image_classification/ConvMixer/convmixer.py b/image_classification/ConvMixer/convmixer.py
new file mode 100644
index 00000000..4e0394d9
--- /dev/null
+++ b/image_classification/ConvMixer/convmixer.py
@@ -0,0 +1,72 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement CNN Class for ConvMixer
+"""
+
+import paddle
+import paddle.nn as nn
+
+
+class Residual(nn.Layer):
+    def __init__(self, fn):
+        super().__init__()
+        self.fn = fn
+
+    def forward(self, x):
+        return self.fn(x) + x
+
+
+def ConvMixer(
+    dim, depth, kernel_size=9, patch_size=7, num_classes=1000, activation='GELU'):
+    if activation == 'ReLU':
+        convmixer_act = nn.ReLU()
+    else:
+        convmixer_act = nn.GELU()
+    return nn.Sequential(
+        nn.Conv2D(3, dim, kernel_size=patch_size, stride=patch_size),
+        convmixer_act,
+        nn.BatchNorm2D(dim),
+        *[
+            nn.Sequential(
+                Residual(
+                    nn.Sequential(
+                        nn.Conv2D(dim, dim, kernel_size, groups=dim, padding="same"),
+                        convmixer_act,
+                        nn.BatchNorm2D(dim),
+                    )
+                ),
+                nn.Conv2D(dim, dim, kernel_size=1),
+                convmixer_act,
+                nn.BatchNorm2D(dim),
+            )
+            for i in range(depth)
+        ],
+        nn.AdaptiveAvgPool2D((1, 1)),
+        nn.Flatten(),
+        nn.Linear(dim, num_classes)
+    )
+
+
+def build_convmixer(config):
+    model = ConvMixer(
+        dim=config.MODEL.CNN.DIM,
+        depth=config.MODEL.CNN.DEPTH,
+        kernel_size=config.MODEL.CNN.KERNEL_SIZE,
+        patch_size=config.MODEL.CNN.PATCH_SIZE,
+        num_classes=config.MODEL.NUM_CLASSES,
+        activation=config.MODEL.CNN.ACTIVATION,
+    )
+    return model
diff --git a/image_classification/ConvMixer/datasets.py b/image_classification/ConvMixer/datasets.py
new file mode 100644
index 00000000..304df9a3
--- /dev/null
+++ b/image_classification/ConvMixer/datasets.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/ConvMixer/droppath.py b/image_classification/ConvMixer/droppath.py
new file mode 100644
index 00000000..c8fe8048
--- /dev/null
+++ b/image_classification/ConvMixer/droppath.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/ConvMixer/losses.py b/image_classification/ConvMixer/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/ConvMixer/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/ConvMixer/main_multi_gpu.py b/image_classification/ConvMixer/main_multi_gpu.py
new file mode 100644
index 00000000..91ad7c7a
--- /dev/null
+++ b/image_classification/ConvMixer/main_multi_gpu.py
@@ -0,0 +1,581 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ConvMixer training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from convmixer import build_convmixer as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ConvMixer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg
+        train_acc_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/ConvMixer/main_single_gpu.py b/image_classification/ConvMixer/main_single_gpu.py
new file mode 100644
index 00000000..92e246e2
--- /dev/null
+++ b/image_classification/ConvMixer/main_single_gpu.py
@@ -0,0 +1,423 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ConvMixer training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from convmixer import build_convmixer as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ConvMixer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/ConvMixer/mixup.py b/image_classification/ConvMixer/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/ConvMixer/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/ConvMixer/random_erasing.py b/image_classification/ConvMixer/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/ConvMixer/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/ConvMixer/run_eval.sh b/image_classification/ConvMixer/run_eval.sh
new file mode 100644
index 00000000..9c2dfe50
--- /dev/null
+++ b/image_classification/ConvMixer/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/convmixer_1536_20.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./convmixer_1536_20'
diff --git a/image_classification/ConvMixer/run_eval_multi.sh b/image_classification/ConvMixer/run_eval_multi.sh
new file mode 100644
index 00000000..5c49d3d1
--- /dev/null
+++ b/image_classification/ConvMixer/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/convmixer_768_32.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./convmixer_768_32'
diff --git a/image_classification/ConvMixer/run_train.sh b/image_classification/ConvMixer/run_train.sh
new file mode 100644
index 00000000..5a998596
--- /dev/null
+++ b/image_classification/ConvMixer/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/convmixer_1024_20.yaml' \
+-dataset='imagenet2012' \
+-batch_size=4 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/ConvMixer/run_train_multi.sh b/image_classification/ConvMixer/run_train_multi.sh
new file mode 100644
index 00000000..abade786
--- /dev/null
+++ b/image_classification/ConvMixer/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/convmixer_1024_20.yaml' \
+-dataset='imagenet2012' \
+-batch_size=4 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/ConvMixer/stat.py b/image_classification/ConvMixer/stat.py
new file mode 100644
index 00000000..efc38267
--- /dev/null
+++ b/image_classification/ConvMixer/stat.py
@@ -0,0 +1,64 @@
+import os
+import glob
+import paddle
+from config import get_config
+from convmixer import build_convmixer as build_model
+
+def count_gelu(layer, inputs, output):
+    activation_flops = 8
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, inputs, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, inputs, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    #input_size = (1, 3, 384, 384)
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/ConvMixer/tests/__init__.py b/image_classification/ConvMixer/tests/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/ConvMixer/tests/test_onecyclelr.py b/image_classification/ConvMixer/tests/test_onecyclelr.py
new file mode 100644
index 00000000..37fba8e9
--- /dev/null
+++ b/image_classification/ConvMixer/tests/test_onecyclelr.py
@@ -0,0 +1,55 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+from utils import OneCycleLRScheduler
+
+
+class TestUM(unittest.TestCase):
+
+    def setUp(self):
+        pass
+
+    def test_one_cycle_lr(self):
+        total_steps = 1000
+        max_lr = 1e-5
+        scheduler = OneCycleLRScheduler(learning_rate=1, max_lr=max_lr, total_steps=total_steps)
+        lr_list = list()
+        for i in range(total_steps):
+            lr_list.append(scheduler.get_lr())
+            scheduler.step()
+
+        self.assertAlmostEqual(max_lr, lr_list[int(total_steps * 0.4)], places=8)
+        self.assertAlmostEqual(max_lr / 20, lr_list[0], places=8)
+        self.assertAlmostEqual(max_lr / 20, lr_list[int(total_steps * 0.8)], places=8)
+        self.assertAlmostEqual(0, lr_list[-1], places=8)
+
+    def test_one_cycle_lr_with_last_epoch(self):
+        total_steps = 1000
+        max_lr = 1e-5
+        last_epoch = 399
+        scheduler = OneCycleLRScheduler(learning_rate=1, max_lr=max_lr, total_steps=total_steps,
+                                        last_epoch=last_epoch)
+        lr_list = list()
+        for i in range(total_steps - last_epoch):
+            lr_list.append(scheduler.get_lr())
+            scheduler.step()
+
+        self.assertAlmostEqual(0, lr_list[-1], places=8)
+        self.assertAlmostEqual(max_lr, lr_list[0], places=8)
+        self.assertAlmostEqual(max_lr / 20, lr_list[int(total_steps * 0.4)], places=8)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/image_classification/ConvMixer/transforms.py b/image_classification/ConvMixer/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/ConvMixer/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/ConvMixer/utils.py b/image_classification/ConvMixer/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/ConvMixer/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/CrossViT/README.md b/image_classification/CrossViT/README.md
new file mode 100755
index 00000000..83a13329
--- /dev/null
+++ b/image_classification/CrossViT/README.md
@@ -0,0 +1,174 @@
+# CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [arxiv](https://arxiv.org/abs/2103.14899) 
+
+PaddlePaddle training/validation code and pretrained models for **CrossViT**.
+
+The official pytorch implementation is [here](https://github.com/IBM/CrossViT).
+
+This implementation is developed by [PPViT](https://github.com/BR-IDL/PaddleViT).
+
+
+
+<img src="./crossvit.png" alt="drawing" width="60%" height="60%"/>
+<figcaption align = "center">CrossVit Model Overview</figcaption>
+
+### Update 
+- Update (2021-09-27): Model FLOPs and # params are uploaded.
+- Update (2021-09-16): Code is released and ported weights are uploaded.
+- Update (2021-09-22): Support more models eval.
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| cross_vit_tiny_224 			| 73.20 | 91.90 | 6.9M    | 1.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ILTVwQtetcb_hdRjki2ZbR26p-8j5LUp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1byeUsM34_gFL0jVr5P5GAw)(scvb) |
+| cross_vit_small_224 			| 81.01 | 95.33 | 26.7M   | 5.2G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ViOJiwbOxTbk1V2Go7PlCbDbWPbjWPJH/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1I9CrpdPU_D5LniqIVBoIPQ)(32us) |
+| cross_vit_base_224 			| 82.12 | 95.87 | 104.7M  | 20.2G  | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vTorkc63O4JE9cYUMHBRxFMDOFoC-iK7/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1TR_aBHQ2n1J0RgHFoVh_bw)(jj2q) |
+| cross_vit_9_224 				| 73.78 | 91.93 | 8.5M    | 1.6G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1UCX9_mJSx2kDAmEd_xDXyd4e6-Mg3RPf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1M8r5vqMHJ-rFwBoW1uL2qQ)(mjcb) |
+| cross_vit_15_224 				| 81.51 | 95.72 | 27.4M   | 5.2G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HwkLWdz6A3Nz-dVbw4ZUcCkxUbPXgHwM/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wiO_Gjk4fvSq08Ud8xKwVw)(n55b) |
+| cross_vit_18_224 				| 82.29 | 96.00 | 43.1M   | 8.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1C4b_a_6ia8NCEXSUEMDdCEFzedr0RB_m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w7VJ7DNqq6APuY7PdlKEjA)(xese) |
+| cross_vit_9_dagger_224 		| 76.92 | 93.61 | 8.7M    | 1.7G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1_cXQ0M8Hr9UyugZk07DrsBl8dwwCA6br/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1F1tRSaG4EfCV_WiTEwXxBw)(58ah) |
+| cross_vit_15_dagger_224 		| 82.23 | 95.93 | 28.1M   | 5.6G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1cCgBoozh2WFtSz42LwEUUPPyC5KmkAFg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1xJ4P2zy3r9RcNFSMtzvZgg)(qwup) |
+| cross_vit_18_dagger_224 		| 82.51 | 96.03 | 44.1M   | 8.7G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1sdAbWxKL5k3QIo1zdgHzasIOtpy_Ogpw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15qYHgt0iRxdhtXoC_ct2Jg)(qtw4) |
+| cross_vit_15_dagger_384 		| 83.75 | 96.75 | 28.1M   | 16.4G  | 384   	    | 1.0      | bicubic       | [google](https://drive.google.com/file/d/12LQjYbs9-LyrY1YeRt46x9BTB3NJuhpJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1d-BAm03azLP_CyEHF3c7ZQ)(w71e) |
+| cross_vit_18_dagger_384 		| 84.17 | 96.82 | 44.1M   | 25.8G  | 384   	    | 1.0 	   | bicubic       | [google](https://drive.google.com/file/d/1CeGwB6Tv0oL8QtL0d7Ar-d02Lg_PqACr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1l_6PTldZ3IDB7XWgjM6LhA)(99b6) |
+|
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./crossvit_base_224.pdparams`, to use the `crossvit_base_224` model in python:
+```python
+from config import get_config
+from crossvit import build_crossvit as build_model
+# config files in ./configs/
+config = get_config('./configs/crossvit_base_224.yaml.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./crossvit_base_224.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate CrossViT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/crossvit_base_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/crossvit_base_224  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/crossvit_base_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/crossvit_base_224  # .pdparams is NOT needed
+```
+
+</details>
+
+
+## Training
+To train the CrossViT Transformer model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/crossvit_base_224.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=16 \
+  -data_path=/path/to/dataset/imagenet/train \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+    -cfg=./configs/crossvit_base_224.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=32 \
+    -data_path=/path/to/dataset/imagenet/train \
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{chen2021crossvit,
+  title={Crossvit: Cross-attention multi-scale vision transformer for image classification},
+  author={Chen, Chun-Fu and Fan, Quanfu and Panda, Rameswar},
+  journal={arXiv preprint arXiv:2103.14899},
+  year={2021}
+}
+```
\ No newline at end of file
diff --git a/image_classification/CrossViT/__init__.py b/image_classification/CrossViT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/CrossViT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/CrossViT/augment.py b/image_classification/CrossViT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/CrossViT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/CrossViT/config.py b/image_classification/CrossViT/config.py
new file mode 100644
index 00000000..b1c51aa4
--- /dev/null
+++ b/image_classification/CrossViT/config.py
@@ -0,0 +1,193 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 256 #256 # train batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #64 # val batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'CrossViT'
+_C.MODEL.NAME = 'CrossViT'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+# IMG_SIZE: [240, 224]
+# PATCH_SIZE: [12, 16]
+# EMBED_DIM: [384, 768]
+# DEPTH: [[1, 4, 0], [1, 4, 0], [1, 4, 0]]
+# NUM_HEADS: [12, 12]
+# MLP_RATIO: [4, 4, 1]
+# QKV_BIAS: True
+# MULTI_CONV: False
+_C.MODEL.TRANS.IMG_SIZE=[240, 224]
+_C.MODEL.TRANS.PATCH_SIZE = [12, 16]
+_C.MODEL.TRANS.EMBED_DIM = [384, 768]
+_C.MODEL.TRANS.MLP_RATIO= [4, 4, 1]
+_C.MODEL.TRANS.NUM_HEADS = [12, 12]
+_C.MODEL.TRANS.DEPTH = [[1, 4, 0], [1, 4, 0], [1, 4, 0]]
+_C.MODEL.TRANS.QKV_BIAS = True
+_C.MODEL.TRANS.MULTI_CONV=False
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 30
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.004
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8 # mixup alpha, enabled if >0
+_C.TRAIN.CUTMIX_ALPHA = 1.0 # cutmix alpha, enabled if >0
+_C.TRAIN.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.TRAIN.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.TRAIN.MIXUP_MODE = 'batch' # how to apply mixup/cutmix params, per 'batch', 'pair' or 'elem'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+_C.TRAIN.DISTILLATION_TYPE = 'hard' # hard, soft, none 
+_C.TRAIN.DISTILLATION_ALPHA = 0.5
+_C.TRAIN.DISTILLATION_TAU = 1.0
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 10 # freq to save chpt
+_C.REPORT_FREQ = 1 # freq to logging info
+_C.VALIDATE_FREQ = 100 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/CrossViT/configs/crossvit_15_224.yaml b/image_classification/CrossViT/configs/crossvit_15_224.yaml
new file mode 100644
index 00000000..db3718f0
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_15_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_15_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [192, 384]
+        DEPTH: [[1, 5, 0], [1, 5, 0], [1, 5, 0]]
+        NUM_HEADS: [6, 6]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: False
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_15_dagger_224.yaml b/image_classification/CrossViT/configs/crossvit_15_dagger_224.yaml
new file mode 100644
index 00000000..eaf84b6d
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_15_dagger_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_15_dagger_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [192, 384]
+        DEPTH: [[1, 5, 0], [1, 5, 0], [1, 5, 0]]
+        NUM_HEADS: [6, 6]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: True
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_15_dagger_384.yaml b/image_classification/CrossViT/configs/crossvit_15_dagger_384.yaml
new file mode 100644
index 00000000..8d449d05
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_15_dagger_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_15_dagger_384
+    TRANS:
+        IMG_SIZE: [408, 384]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [192, 384]
+        DEPTH: [[1, 5, 0], [1, 5, 0], [1, 5, 0]]
+        NUM_HEADS: [6, 6]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: True
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_18_224.yaml b/image_classification/CrossViT/configs/crossvit_18_224.yaml
new file mode 100644
index 00000000..4018c8cb
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_18_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_18_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [224, 448]
+        DEPTH: [[1, 6, 0], [1, 6, 0], [1, 6, 0]]
+        NUM_HEADS: [7, 7]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: False
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_18_dagger_224.yaml b/image_classification/CrossViT/configs/crossvit_18_dagger_224.yaml
new file mode 100644
index 00000000..f17659fc
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_18_dagger_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_18_dagger_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [224, 448]
+        DEPTH: [[1, 6, 0], [1, 6, 0], [1, 6, 0]]
+        NUM_HEADS: [7, 7]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: True
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_18_dagger_384.yaml b/image_classification/CrossViT/configs/crossvit_18_dagger_384.yaml
new file mode 100644
index 00000000..3c9814b2
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_18_dagger_384.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_18_dagger_224
+    TRANS:
+        IMG_SIZE: [408, 384]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [224, 448]
+        DEPTH: [[1, 6, 0], [1, 6, 0], [1, 6, 0]]
+        NUM_HEADS: [7, 7]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: True
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_9_224.yaml b/image_classification/CrossViT/configs/crossvit_9_224.yaml
new file mode 100644
index 00000000..e3eaf809
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_9_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_9_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [128, 256]
+        DEPTH: [[1, 3, 0], [1, 3, 0], [1, 3, 0]]
+        NUM_HEADS: [4, 4]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: False
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_9_dagger_224.yaml b/image_classification/CrossViT/configs/crossvit_9_dagger_224.yaml
new file mode 100644
index 00000000..a201957c
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_9_dagger_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_9_dagger_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [128, 256]
+        DEPTH: [[1, 3, 0], [1, 3, 0], [1, 3, 0]]
+        NUM_HEADS: [4, 4]
+        MLP_RATIO: [3, 3, 1]
+        QKV_BIAS: True
+        MULTI_CONV: True
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_base_224.yaml b/image_classification/CrossViT/configs/crossvit_base_224.yaml
new file mode 100644
index 00000000..e14818e8
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_base_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_base_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [384, 768]
+        DEPTH: [[1, 4, 0], [1, 4, 0], [1, 4, 0]]
+        NUM_HEADS: [12, 12]
+        MLP_RATIO: [4, 4, 1]
+        QKV_BIAS: True
+        MULTI_CONV: False
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_small_224.yaml b/image_classification/CrossViT/configs/crossvit_small_224.yaml
new file mode 100644
index 00000000..120c016e
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_small_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_small_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [192, 384]
+        DEPTH: [[1, 4, 0], [1, 4, 0], [1, 4, 0]]
+        NUM_HEADS: [6, 6]
+        MLP_RATIO: [4, 4, 1]
+        QKV_BIAS: True
+        MULTI_CONV: False
\ No newline at end of file
diff --git a/image_classification/CrossViT/configs/crossvit_tiny_224.yaml b/image_classification/CrossViT/configs/crossvit_tiny_224.yaml
new file mode 100644
index 00000000..88b7f7fe
--- /dev/null
+++ b/image_classification/CrossViT/configs/crossvit_tiny_224.yaml
@@ -0,0 +1,15 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CrossViT
+    NAME: crossvit_vit_tiny_224
+    TRANS:
+        IMG_SIZE: [240, 224]
+        PATCH_SIZE: [12, 16]
+        EMBED_DIM: [96, 192]
+        DEPTH: [[1, 4, 0], [1, 4, 0], [1, 4, 0]]
+        NUM_HEADS: [3, 3]
+        MLP_RATIO: [4, 4, 1]
+        QKV_BIAS: True
+        MULTI_CONV: False
\ No newline at end of file
diff --git a/image_classification/CrossViT/crossvit.png b/image_classification/CrossViT/crossvit.png
new file mode 100644
index 00000000..f60dc539
Binary files /dev/null and b/image_classification/CrossViT/crossvit.png differ
diff --git a/image_classification/CrossViT/crossvit.py b/image_classification/CrossViT/crossvit.py
new file mode 100755
index 00000000..f59976b3
--- /dev/null
+++ b/image_classification/CrossViT/crossvit.py
@@ -0,0 +1,442 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Cross ViT Class"""
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from functools import partial
+from t2t import T2T, get_sinusoid_encoding
+from crossvit_utils import *
+
+class PatchEmbed(nn.Layer):
+    """ Image to Patch Embedding
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, multi_conv=False):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        if multi_conv:
+            if patch_size[0] == 12:
+                self.proj = nn.Sequential(
+                    nn.Conv2D(in_chans, embed_dim // 4, kernel_size=7, stride=4, padding=3),
+                    nn.ReLU(),
+                    nn.Conv2D(embed_dim // 4, embed_dim // 2, kernel_size=3, stride=3, padding=0),
+                    nn.ReLU(),
+                    nn.Conv2D(embed_dim // 2, embed_dim, kernel_size=3, stride=1, padding=1),
+                )
+            elif patch_size[0] == 16:
+                self.proj = nn.Sequential(
+                    nn.Conv2D(in_chans, embed_dim // 4, kernel_size=7, stride=4, padding=3),
+                    nn.ReLU(),
+                    nn.Conv2D(embed_dim // 4, embed_dim // 2, kernel_size=3, stride=2, padding=1),
+                    nn.ReLU(),
+                    nn.Conv2D(embed_dim // 2, embed_dim, kernel_size=3, stride=2, padding=1),
+                )
+        else:
+            self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            f"Input image size ({H}*{W}) doesn't match ({self.img_size[0]}*{self.img_size[1]})."
+        x = self.proj(x).flatten(2).transpose((0, 2, 1))
+
+        return x
+
+
+class CrossAttention(nn.Layer):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.wq = nn.Linear(dim, dim, weight_attr=w_attr_1, bias_attr=b_attr_1)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.wk = nn.Linear(dim, dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
+        w_attr_3, b_attr_3 = self._init_weights()
+        self.wv = nn.Linear(dim, dim, weight_attr=w_attr_3, bias_attr=b_attr_3)
+        self.attn_drop = nn.Dropout(attn_drop)
+        w_attr_4, b_attr_4 = self._init_weights()
+        self.proj = nn.Linear(dim, dim, weight_attr=w_attr_4, bias_attr=b_attr_4)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        B, N, C = x.shape
+        q = self.wq(x[:, 0:1, :]).reshape([B, 1, self.num_heads, C // self.num_heads]).transpose(
+            (0, 2, 1, 3))  # B1C -> B1H(C/H) -> BH1(C/H)
+        k = self.wk(x).reshape([B, N, self.num_heads, C // self.num_heads]).transpose(
+            (0, 2, 1, 3))  # BNC -> BNH(C/H) -> BHN(C/H)
+        v = self.wv(x).reshape([B, N, self.num_heads, C // self.num_heads]).transpose(
+            (0, 2, 1, 3))  # BNC -> BNH(C/H) -> BHN(C/H)
+
+        attn = (q @ k.transpose((0, 1, 3, 2))) * self.scale  # BH1(C/H) @ BH(C/H)N -> BH1N
+        attn = F.softmax(attn, axis=-1)
+        attn = self.attn_drop(attn)
+
+        # (BH1N @ BHN(C/H)) -> BH1(C/H) -> B1H(C/H) -> B1C
+        x = (attn @ v).transpose((0, 2, 1, 3)).reshape([B, 1, C])  
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class CrossAttentionBlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 has_mlp=True):
+        super(CrossAttentionBlock, self).__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.norm1 = nn.LayerNorm(dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
+        self.attn = CrossAttention(dim,
+                                   num_heads=num_heads,
+                                   qkv_bias=qkv_bias,
+                                   qk_scale=qk_scale,
+                                   attn_drop=attn_drop,
+                                   proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
+        self.has_mlp = has_mlp
+        if has_mlp:
+            self.norm2 = norm_layer(dim)
+            mlp_hidden_dim = int(dim * mlp_ratio)
+            self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, dropout=drop)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = x[:, 0:1, :] + self.drop_path(self.attn(self.norm1(x)))
+        if self.has_mlp:
+            x = x + self.drop_path(self.mlp(self.norm2(x)))
+
+        return x
+
+
+class MultiScaleBlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 patches,
+                 depth,
+                 num_heads,
+                 mlp_ratio,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=[]):
+        super().__init__()
+
+        num_branches = len(dim)
+        self.num_branches = num_branches
+        # different branch could have different embedding size, the first one is the base
+        self.blocks = nn.LayerList()
+        for d in range(num_branches):
+            tmp = []
+            for i in range(depth[d]):
+                tmp.append(
+                    Block(dim=dim[d],
+                          num_heads=num_heads[d],
+                          mlp_ratio=mlp_ratio[d],
+                          qkv_bias=qkv_bias,
+                          qk_scale=qk_scale,
+                          dropout=drop,
+                          attention_dropout=attn_drop,
+                          droppath=drop_path[i]))
+            if len(tmp) != 0:
+                self.blocks.append(nn.Sequential(*tmp))
+
+        if len(self.blocks) == 0:
+            self.blocks = None
+
+        self.projs = nn.LayerList()
+        for d in range(num_branches):
+            if dim[d] == dim[(d + 1) % num_branches] and False:
+                tmp = [Identity()]
+            else:
+                w_attr_1, b_attr_1 = self._init_weights_norm()
+                w_attr_2, b_attr_2 = self._init_weights_linear()
+                tmp = [nn.LayerNorm(dim[d], weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6),
+                       nn.GELU(),
+                       nn.Linear(dim[d],
+                                 dim[(d + 1) % num_branches],
+                                 weight_attr=w_attr_2,
+                                 bias_attr=b_attr_2)]
+            self.projs.append(nn.Sequential(*tmp))
+
+        self.fusion = nn.LayerList()
+        for d in range(num_branches):
+            d_ = (d + 1) % num_branches
+            nh = num_heads[d_]
+            if depth[-1] == 0:  # backward capability:
+                self.fusion.append(
+                    CrossAttentionBlock(dim=dim[d_],
+                                        num_heads=nh,
+                                        mlp_ratio=mlp_ratio[d],
+                                        qkv_bias=qkv_bias,
+                                        qk_scale=qk_scale,
+                                        drop=drop,
+                                        attn_drop=attn_drop,
+                                        drop_path=drop_path[-1],
+                                        has_mlp=False))
+            else:
+                tmp = []
+                for _ in range(depth[-1]):
+                    tmp.append(CrossAttentionBlock(dim=dim[d_],
+                                                   num_heads=nh,
+                                                   mlp_ratio=mlp_ratio[d],
+                                                   qkv_bias=qkv_bias,
+                                                   qk_scale=qk_scale,
+                                                   drop=drop,
+                                                   attn_drop=attn_drop,
+                                                   drop_path=drop_path[-1],
+                                                   has_mlp=False))
+                self.fusion.append(nn.Sequential(*tmp))
+
+        self.revert_projs = nn.LayerList()
+        for d in range(num_branches):
+            if dim[(d + 1) % num_branches] == dim[d] and False:
+                tmp = [Identity()]
+            else:
+                w_attr_1, b_attr_1 = self._init_weights_norm()
+                w_attr_2, b_attr_2 = self._init_weights_linear()
+                tmp = [nn.LayerNorm(dim[(d + 1) % num_branches],
+                                    weight_attr=w_attr_1,
+                                    bias_attr=w_attr_1),
+                       nn.GELU(),
+                       nn.Linear(dim[(d + 1) % num_branches],
+                                 dim[d],
+                                 weight_attr=w_attr_2,
+                                 bias_attr=b_attr_2)]
+            self.revert_projs.append(nn.Sequential(*tmp))
+
+    def _init_weights_norm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights_linear(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        outs_b = [block(x_) for x_, block in zip(x, self.blocks)]
+        # only take the cls token out
+        proj_cls_token = [proj(x[:, 0:1]) for x, proj in zip(outs_b, self.projs)]
+        # cross attention
+        outs = []
+        for i in range(self.num_branches):
+            tmp = paddle.concat((proj_cls_token[i], outs_b[(i + 1) % self.num_branches][:, 1:, :]), axis=1)
+            tmp = self.fusion[i](tmp)
+            reverted_proj_cls_token = self.revert_projs[i](tmp[:, 0:1, :])
+            tmp = paddle.concat((reverted_proj_cls_token, outs_b[i][:, 1:, :]), axis=1)
+            outs.append(tmp)
+        return outs
+
+
+def _compute_num_patches(img_size, patches):
+    return [i // p * i // p for i, p in zip(img_size, patches)]
+
+
+class VisionTransformer(nn.Layer):
+    """ Vision Transformer with support for patch or hybrid CNN input stage
+    """
+    def __init__(self,
+                 img_size=(224, 224),
+                 patch_size=(8, 16),
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=(192, 384),
+                 depth=([1, 3, 1], [1, 3, 1], [1, 3, 1]),
+                 num_heads=(6, 12),
+                 mlp_ratio=(2., 2., 4.),
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 hybrid_backbone=None,
+                 multi_conv=False):
+        super().__init__()
+
+        self.num_classes = num_classes
+        if not isinstance(img_size, list):
+            img_size = to_2tuple(img_size)
+        self.img_size = img_size
+
+        num_patches = _compute_num_patches(img_size, patch_size)
+        self.num_branches = len(patch_size)
+        self.patch_embed = nn.LayerList()
+        if hybrid_backbone is None:
+            self.pos_embed = nn.ParameterList(
+                [paddle.create_parameter(
+                    shape=[1, 1 + num_patches[i], embed_dim[i]],
+                    dtype='float32',
+                    default_initializer=nn.initializer.Constant(
+                        0.0)) for i in range(self.num_branches)])
+
+            for im_s, p, d in zip(img_size, patch_size, embed_dim):
+                self.patch_embed.append(
+                    PatchEmbed(img_size=im_s,
+                               patch_size=p,
+                               in_chans=in_chans,
+                               embed_dim=d,
+                               multi_conv=multi_conv))
+        else:
+            self.pos_embed = nn.ParameterList()
+            tokens_type = 'transformer' if hybrid_backbone == 't2t' else 'performer'
+            for idx, (im_s, p, d) in enumerate(zip(img_size, patch_size, embed_dim)):
+                self.patch_embed.append(
+                    T2T(im_s, tokens_type=tokens_type, patch_size=p, embed_dim=d))
+                self.pos_embed.append(
+                    paddle.to_tensor(data=get_sinusoid_encoding(n_position=1 + num_patches[idx],
+                                                                d_hid=embed_dim[idx]),
+                                     dtype='flaot32',
+                                     stop_gradient=False))
+
+            del self.pos_embed
+            self.pos_embed = nn.ParameterList(
+                [paddle.to_tensor(
+                    paddle.zeros(1, 1 + num_patches[i], embed_dim[i]),
+                    dtype='float32',
+                    stop_gradient=False) for i in range(self.num_branches)])
+
+        self.cls_token = nn.ParameterList(
+            [paddle.create_parameter(
+                shape=[1, 1, embed_dim[i]], dtype='float32') for i in range(self.num_branches)])
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        total_depth = sum([sum(x[-2:]) for x in depth])
+        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, total_depth)]
+        dpr_ptr = 0
+        self.blocks = nn.LayerList()
+        for idx, block_cfg in enumerate(depth):
+            curr_depth = max(block_cfg[:-1]) + block_cfg[-1]
+            dpr_ = dpr[dpr_ptr:dpr_ptr + curr_depth]
+            blk = MultiScaleBlock(embed_dim,
+                                  num_patches,
+                                  block_cfg,
+                                  num_heads=num_heads,
+                                  mlp_ratio=mlp_ratio,
+                                  qkv_bias=qkv_bias,
+                                  qk_scale=qk_scale,
+                                  drop=drop_rate,
+                                  attn_drop=attn_drop_rate,
+                                  drop_path=dpr_)
+            dpr_ptr += curr_depth
+            self.blocks.append(blk)
+    
+        
+        w_attr_1, b_attr_1 = self._init_weights_norm()
+        w_attr_2, b_attr_2 = self._init_weights_linear()
+        self.norm = nn.LayerList([nn.LayerNorm(embed_dim[i],
+            weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6) for i in range(self.num_branches)])
+        self.head = nn.LayerList(
+            [nn.Linear(embed_dim[i],
+                       num_classes,
+                       weight_attr=w_attr_2,
+                       bias_attr=b_attr_2) if num_classes > 0 else Identity() for i in range(self.num_branches)])
+
+    def _init_weights_norm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights_linear(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def no_weight_decay(self):
+        out = {'cls_token'}
+        if self.pos_embed[0].requires_grad:
+            out.add('pos_embed')
+        return out
+
+    def get_classifier(self):
+        return self.head
+
+    def reset_classifier(self, num_classes, global_pool=''):
+        self.num_classes = num_classes
+        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else Identity()
+
+    def forward_features(self, x):
+        B, C, H, W = x.shape
+        xs = []
+        for i in range(self.num_branches):
+            x_ = paddle.nn.functional.interpolate(
+                x, size=(self.img_size[i],
+                         self.img_size[i]),
+                         mode='bicubic') if H != self.img_size[i] else x
+            tmp = self.patch_embed[i](x_)
+            cls_tokens = self.cls_token[i].expand([B, -1, -1])  # stole cls_tokens impl from Phil Wang, thanks
+            # print(cls_tokens.shape,tmp.shape)
+            tmp = paddle.concat((cls_tokens, tmp), axis=1)
+            # print(tmp.shape,self.pos_embed[i].shape)
+            tmp = tmp+self.pos_embed[i]
+            tmp = self.pos_drop(tmp)
+            xs.append(tmp)
+
+        for blk in self.blocks:
+            xs = blk(xs)
+            # print(xs.shape)
+
+        # NOTE: was before branch token section, move to here to assure all branch token are before layer norm
+        xs = [self.norm[i](x) for i, x in enumerate(xs)]
+        out = [x[:, 0] for x in xs]
+
+        return out
+
+    def forward(self, x):
+        xs = self.forward_features(x)
+        ce_logits = [self.head[i](x) for i, x in enumerate(xs)]
+        ce_logits = paddle.mean(paddle.stack(ce_logits, axis=0), axis=0)
+        return ce_logits
+
+
+def build_crossvit(config, **kwargs):
+    model = VisionTransformer(img_size=config.MODEL.TRANS.IMG_SIZE,
+                              num_classes=config.MODEL.NUM_CLASSES,
+                              patch_size=config.MODEL.TRANS.PATCH_SIZE,
+                              embed_dim=config.MODEL.TRANS.EMBED_DIM,
+                              depth=config.MODEL.TRANS.DEPTH,
+                              num_heads=config.MODEL.TRANS.NUM_HEADS,
+                              mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+                              qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                              multi_conv=config.MODEL.TRANS.MULTI_CONV,
+                              drop_rate=config.MODEL.DROPOUT,
+                              attn_drop_rate=config.MODEL.ATTENTION_DROPOUT,
+                              drop_path_rate=config.MODEL.DROPPATH,
+                              **kwargs)
+    return model
diff --git a/image_classification/CrossViT/crossvit_utils.py b/image_classification/CrossViT/crossvit_utils.py
new file mode 100755
index 00000000..2064ddfe
--- /dev/null
+++ b/image_classification/CrossViT/crossvit_utils.py
@@ -0,0 +1,326 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections.abc
+import math
+import warnings
+from itertools import repeat
+
+import paddle
+import paddle.nn as nn
+from scipy import special
+
+# https://github.com/xperzy/PPViT/blob/91ad6dd625cd39ebb854352eeb95991ec438575d/image_classification/ViT/droppath.py
+class DropPath(nn.Layer):
+    """DropPath class"""
+
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0],) + (1,) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor()  # mask
+        output = inputs.divide(keep_prob) * random_tensor  # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable):
+            return x
+        return tuple(repeat(x, n))
+
+    return parse
+
+
+to_2tuple = _ntuple(2)
+
+
+# https://github.com/xperzy/PPViT/blob/91ad6dd625cd39ebb854352eeb95991ec438575d/gan/transGAN/utils.py
+def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1. + math.erf(x / math.sqrt(2.))) / 2.
+
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn("mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+                      "The distribution of values may be incorrect.",
+                      stacklevel=2)
+
+    with paddle.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        l = norm_cdf((a - mean) / std)
+        u = norm_cdf((b - mean) / std)
+
+        # Uniformly fill tensor with values from [l, u], then translate to
+        # [2l-1, 2u-1].
+        tensor = paddle.uniform(tensor.shape, min=(2 * l - 1), max=(2 * u - 1))
+
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor = paddle.to_tensor(special.erfinv(tensor.numpy()))
+
+        # Transform to proper mean, std
+        tensor = paddle.multiply(tensor, paddle.to_tensor(std * math.sqrt(2.)))
+        tensor = paddle.add(tensor, paddle.to_tensor(mean))
+
+        # Clamp to ensure it's in the proper range
+        tensor = paddle.clip(tensor, min=a, max=b)
+        return tensor
+
+
+def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
+    # type: (Tensor, float, float, float, float) -> Tensor
+    r"""Fills the input Tensor with values drawn from a truncated
+    normal distribution. The values are effectively drawn from the
+    normal distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`
+    with values outside :math:`[a, b]` redrawn until they are within
+    the bounds. The method used for generating the random values works
+    best when :math:`a \leq \text{mean} \leq b`.
+    Args:
+        tensor: an n-dimensional `paddle.Tensor`
+        mean: the mean of the normal distribution
+        std: the standard deviation of the normal distribution
+        a: the minimum cutoff value
+        b: the maximum cutoff value
+    Examples:
+        >>> w = paddle.empty(3, 5)
+        >>> trunc_normal_(w)
+    """
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+
+
+IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
+IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)
+
+
+def _cfg(url='', **kwargs):
+    return {
+        'url': url,
+        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
+        'crop_pct': .9, 'interpolation': 'bicubic', 'fixed_input_size': True,
+        'mean': IMAGENET_INCEPTION_MEAN, 'std': IMAGENET_INCEPTION_STD,
+        'first_conv': 'patch_embed.proj', 'classifier': 'head',
+        **kwargs
+    }
+
+
+# https://github.com/xperzy/PPViT/blob/91ad6dd625cd39ebb854352eeb95991ec438575d/image_classification/MLP-Mixer/mlp_mixer.py
+class Mlp(nn.Layer):
+    """ MLP module
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+
+    def __init__(self, in_features, hidden_features, dropout):
+        super(Mlp, self).__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(in_features,
+                             hidden_features,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(hidden_features,
+                             in_features,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+# https://github.com/xperzy/PPViT/blob/develop/image_classification/ViT/transformer.py
+class Attention(nn.Layer):
+    """ Attention module
+    Attention module for ViT, here q, k, v are assumed the same.
+    The qkv mappings are stored as one single param.
+    Attributes:
+        num_heads: number of heads
+        attn_head_size: feature dim of single head
+        all_head_size: feature dim of all heads
+        qkv: a nn.Linear for q, k, v mapping
+        scales: 1 / sqrt(single_head_feature_dim)
+        out: projection of multi-head attention
+        attn_dropout: dropout for attention
+        proj_dropout: final dropout before output
+        softmax: softmax op for attention
+    """
+
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 dropout=0.,
+                 attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.attn_head_size = int(embed_dim / self.num_heads)
+        self.all_head_size = self.attn_head_size * self.num_heads
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_size * 3,  # weights for q, k, and v
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else False)
+
+        self.scales = self.attn_head_size ** -0.5
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.out = nn.Linear(embed_dim,
+                             embed_dim,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+
+        self.attn_dropout = nn.Dropout(attention_dropout)
+        self.proj_dropout = nn.Dropout(dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        return weight_attr, bias_attr
+
+    def transpose_multihead(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def forward(self, x):
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multihead, qkv)
+
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = attn * self.scales
+        attn = self.softmax(attn)
+        attn_weights = attn
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v)
+        z = z.transpose([0, 2, 1, 3])
+        new_shape = z.shape[:-2] + [self.all_head_size]
+        z = z.reshape(new_shape)
+        # reshape
+        z = self.out(z)
+        z = self.proj_dropout(z)
+        return z
+
+
+# https://github.com/xperzy/PPViT/blob/91ad6dd625cd39ebb854352eeb95991ec438575d/image_classification/T2T_ViT/t2t_vit.py
+class Identity(nn.Layer):
+    """ Identity layer
+    The output of this layer is the input without any change.
+    Use this layer to avoid using 'if' condition in forward methods
+    """
+
+    def __init__(self):
+        super(Identity, self).__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Block(nn.Layer):
+    """ Transformer block layers
+    Transformer block layers contains regular self-attention layers,
+    mlp layers, norms layers and residual blocks.
+    Args:
+        dim: int, all heads dimension
+        num_heads: int, num of heads
+        mlp_ratio: ratio to multiply on mlp input dim as mlp hidden dim, default: 4.
+        qkv_bias: bool, if True, qkv linear layer is using bias, default: False
+        qk_scale: float, scale factor to replace dim_head ** -0.5, default: None
+        dropout: float, dropout rate for projection dropout, default: 0.
+        attention_dropout: float, dropout rate for attention dropout, default: 0.
+        droppath: float, drop path rate, default: 0.
+    """
+
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        self.attn = Attention(dim,
+                              num_heads=num_heads,
+                              qkv_bias=qkv_bias,
+                              qk_scale=qk_scale,
+                              dropout=dropout,
+                              attention_dropout=attention_dropout)
+        self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
+        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        self.mlp = Mlp(in_features=dim,
+                       hidden_features=int(dim * mlp_ratio),
+                       dropout=dropout)
+
+    def forward(self, x):
+        h = x
+        x = self.norm1(x)
+        x = self.attn(x)
+        x = self.drop_path(x)
+        x = h + x
+
+        h = x
+        x = self.norm2(x)
+        x = self.mlp(x)
+        x = self.drop_path(x)
+        x = h + x
+        return x
diff --git a/image_classification/CrossViT/datasets.py b/image_classification/CrossViT/datasets.py
new file mode 100644
index 00000000..984e1fcf
--- /dev/null
+++ b/image_classification/CrossViT/datasets.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    aug_op_list = []
+    # random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0)))
+    # auto_augment / color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, 'bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/CrossViT/losses.py b/image_classification/CrossViT/losses.py
new file mode 100644
index 00000000..04377eac
--- /dev/null
+++ b/image_classification/CrossViT/losses.py
@@ -0,0 +1,144 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, label smoothing rate
+        x: tensor, predictions (default is before softmax) with shape [N, num_classes] as default
+        target: tensor, target label with shape [N] as default
+        weight: tensor, optional, a manual rescaling weight given to each class        
+        reduction: str, optional, indicate how to average the loss by batch_size,
+                   default is ``'mean'``, the candicates are ``'none'`` | ``'mean'`` | ``'sum'``
+        axis: int, optional, the index of dimension to perform softmax calculations,
+                   default is ``-1``, if `axis` is not -1 -> the shape of x and target may not be default
+        use_softmax: bool, optional, if `use_softmax` is ``False``, ``x`` should be after softmax,
+                     default is ``True``, the candicates are ``True`` | ``False``
+        name: str, optional, the name of the operator, default is ``None``,
+              for more information, please refer to :ref:`api_guide_Name`.
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self,
+                 smoothing=0.1,
+                 weight=None,                 
+                 reduction='mean',                 
+                 axis=-1,
+                 use_softmax=True,
+                 name=None):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.weight = weight
+        self.reduction = reduction        
+        self.axis = axis
+        self.use_softmax = use_softmax
+        self.name = name
+
+    def forward(self, x, target):
+        target = paddle.nn.functional.one_hot(target, num_classes=x.shape[1])
+        target = paddle.nn.functional.label_smooth(target, epsilon=self.smoothing)        
+        loss = paddle.nn.functional.cross_entropy(
+            x,
+            target,            
+            weight=self.weight,            
+            reduction=self.reduction,
+            soft_label=True,
+            axis=self.axis,
+            use_softmax=self.use_softmax,
+            name=self.name)
+        return loss
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/CrossViT/main_multi_gpu.py b/image_classification/CrossViT/main_multi_gpu.py
new file mode 100644
index 00000000..73932db0
--- /dev/null
+++ b/image_classification/CrossViT/main_multi_gpu.py
@@ -0,0 +1,608 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""CrossViT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from model_ema import ModelEma
+from crossvit import build_crossvit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CrossViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CrossViT/main_single_gpu.py b/image_classification/CrossViT/main_single_gpu.py
new file mode 100644
index 00000000..c8ad8bc4
--- /dev/null
+++ b/image_classification/CrossViT/main_single_gpu.py
@@ -0,0 +1,453 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""CrossViT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import copy
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from crossvit import build_crossvit as build_model
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from model_ema import ModelEma
+from crossvit import build_corssvit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CrossViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image) # output[0]: class_token, output[1]: distill_token
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None:
+            model_ema.update(model)
+
+        # average of output and kd_output, like model eval mode
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CrossViT/mixup.py b/image_classification/CrossViT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/CrossViT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/CrossViT/model_ema.py b/image_classification/CrossViT/model_ema.py
new file mode 100644
index 00000000..8a636765
--- /dev/null
+++ b/image_classification/CrossViT/model_ema.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/CrossViT/port_weights/__init__.py b/image_classification/CrossViT/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/CrossViT/port_weights/demo.py b/image_classification/CrossViT/port_weights/demo.py
new file mode 100644
index 00000000..6016bd67
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/demo.py
@@ -0,0 +1,81 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import cv2
+import numpy as np
+import paddle
+from config import get_config
+from config import update_config
+from crossvit import build_crossvit
+
+
+def print_model_named_params(model):
+    """
+    model params print
+    """
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    """
+    buffer params print
+    """
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def main():
+    """
+    build model from config,
+    image data pre-process,at here we don't sub image-net  mean and divided std,but it doesn't effect the final result
+    zerbra.jpg predict id will be 340,if there nothing wrong.
+    """
+    parser = argparse.ArgumentParser('CrossViT')
+    parser.add_argument('-cfg', type=str, default="configs/crossvit_base_224.yaml")
+    args = parser.parse_args()
+    config = get_config()
+    config = update_config(config, args)
+
+    paddle.set_device('cpu')
+
+    paddle_model = build_crossvit(config)
+    state_dict = paddle.load('port_weights/pd_crossvit_base_224.pdparams')
+    paddle_model.load_dict(state_dict)
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+
+    image_x = cv2.imread('zerbra.jpeg')
+    resize_x = cv2.resize(image_x, (224, 224)) / 255.
+    resize_x = resize_x.transpose((2, 0, 1))
+    resize_x = np.expand_dims(resize_x, axis=0).astype('float32')
+    print(resize_x.shape)
+    x_paddle = paddle.to_tensor(resize_x)
+    print(x_paddle.shape)
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+    print('========================================================')
+    print(np.argmax(out_paddle))
+    print('done!')
+
+
+main()
diff --git a/image_classification/CrossViT/port_weights/google_weights.txt b/image_classification/CrossViT/port_weights/google_weights.txt
new file mode 100644
index 00000000..d3bb9904
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/google_weights.txt
@@ -0,0 +1,10 @@
+pd_crossvit_base_224.pdparams   https://drive.google.com/file/d/1tQE0gR2yKFhLFy6_CohER4E0JG_c1HF3/view?usp=sharing
+pd_crossvit_tiny_224.pdparams	   https://drive.google.com/file/d/1asOr6WqwOv-XN2fLROhHXz-McxS8paet/view?usp=sharing
+pd_crossvit_small_224.pdparams  https://drive.google.com/file/d/1KjYBH89gCEfT5xrjKkQ_o7_9amC-Zsp4/view?usp=sharing
+pd_crossvit_18_dagger_224.pdparams  https://drive.google.com/file/d/1xJTRDpbjG84RThm1x3mOOTcQbc3edg_C/view?usp=sharing
+pd_crossvit_18_224.pdparams   https://drive.google.com/file/d/1tDwJzitlyxp7Pp0TaCKeFKxt1NxHLysX/view?usp=sharing
+pd_crossvit_15_dagger_384.pdparams   https://drive.google.com/file/d/1jOtXNAS2qpD4UXH_XK6w7KDSZZwjAK83/view?usp=sharing
+pd_crossvit_15_dagger_224.pdparams   https://drive.google.com/file/d/1oBO6kFJ2PcE-ifcPkOyv1GbZJTTCEm_d/view?usp=sharing
+pd_crossvit_15_224.pdparams  https://drive.google.com/file/d/1l_ty0u5Tak18U4DtQOKLj9elZ4sY7yjd/view?usp=sharing
+pd_crossvit_9_dagger_224.pdparams   https://drive.google.com/file/d/13DFAwtc2-AlubypCsEI6JaOgEaVVBzDI/view?usp=sharing
+pd_crossvit_9_224.pdparams  https://drive.google.com/file/d/1VaiE40DGaXtzIVLkCp7v2Vy0p4MbrO-R/view?usp=sharing
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights.py b/image_classification/CrossViT/port_weights/load_pytorch_weights.py
new file mode 100755
index 00000000..90c93058
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_base_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_base_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_base_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_15_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_15_224.py
new file mode 100644
index 00000000..a336b427
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_15_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_15_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_15_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-3)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_15_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_15_dagger_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_15_dagger_224.py
new file mode 100644
index 00000000..b4d094f7
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_15_dagger_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_15_dagger_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_15_dagger_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-3)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_15_dagger_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_15_dagger_384.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_15_dagger_384.py
new file mode 100644
index 00000000..e20be77f
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_15_dagger_384.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_15_dagger_384()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_15_dagger_384(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-2)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_15_dagger_384.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_18_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_18_224.py
new file mode 100644
index 00000000..a82cec4a
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_18_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_18_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_18_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-3)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_18_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_18_dagger_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_18_dagger_224.py
new file mode 100644
index 00000000..02bb47c2
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_18_dagger_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_18_dagger_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_18_dagger_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-3)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_18_dagger_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_9_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_9_224.py
new file mode 100644
index 00000000..4bf5c931
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_9_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_9_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_9_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-2)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_9_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_9_dagger_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_9_dagger_224.py
new file mode 100644
index 00000000..dcdbfe75
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_9_dagger_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_9_dagger_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_9_dagger_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-2)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_9_dagger_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_multi_test.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_multi_test.py
new file mode 100644
index 00000000..c6a4a385
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_multi_test.py
@@ -0,0 +1,110 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from image_classification.CrossViT.models.crossvit import *
+from image_classification.CrossViT.crossvit import *
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model, torch_model):
+    mapping = []
+    for (name, param), (name2, param2) in zip(paddle_model.named_parameters(), torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model, torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def pd_gen_model_test():
+    paddle.set_device('cpu')
+    # paddle_model = pd_crossvit_tiny_224()
+    # paddle_model = pd_crossvit_small_224()
+    # paddle_model = pd_crossvit_9_224()
+    # paddle_model = pd_crossvit_15_224()
+    # paddle_model = pd_crossvit_18_224()
+    # paddle_model = pd_crossvit_9_dagger_224()
+    # paddle_model = pd_crossvit_15_dagger_224()
+    # paddle_model = pd_crossvit_15_dagger_384()
+    # paddle_model = pd_crossvit_18_dagger_224()
+    paddle_model = pd_crossvit_18_dagger_384()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+
+if __name__ == "__main__":
+    # main()
+    pd_gen_model_test()
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_small_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_small_224.py
new file mode 100644
index 00000000..99b04ac0
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_small_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_small_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_small_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-3)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_small_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/port_weights/load_pytorch_weights_tiny_224.py b/image_classification/CrossViT/port_weights/load_pytorch_weights_tiny_224.py
new file mode 100644
index 00000000..8fc706eb
--- /dev/null
+++ b/image_classification/CrossViT/port_weights/load_pytorch_weights_tiny_224.py
@@ -0,0 +1,138 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from image_classification.CrossViT.models.crossvit import *
+import os
+import torch
+import numpy as np
+from image_classification.CrossViT.crossvit import *
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = pd_crossvit_tiny_224()
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model =crossvit_tiny_224(pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-2)
+
+    # save weights for paddle model
+    model_path = os.path.join('./pd_crossvit_tiny_224.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/CrossViT/random_erasing.py b/image_classification/CrossViT/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/CrossViT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/CrossViT/run_eval_15_224.sh b/image_classification/CrossViT/run_eval_15_224.sh
new file mode 100644
index 00000000..52b326c4
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_15_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_15_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_15_224'
diff --git a/image_classification/CrossViT/run_eval_15_dagger_224.sh b/image_classification/CrossViT/run_eval_15_dagger_224.sh
new file mode 100644
index 00000000..f3a132a4
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_15_dagger_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_15_dagger_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_15_dagger_224'
diff --git a/image_classification/CrossViT/run_eval_15_dagger_384.sh b/image_classification/CrossViT/run_eval_15_dagger_384.sh
new file mode 100644
index 00000000..d8573d8b
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_15_dagger_384.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_15_dagger_384.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_15_dagger_384'
diff --git a/image_classification/CrossViT/run_eval_18_224.sh b/image_classification/CrossViT/run_eval_18_224.sh
new file mode 100644
index 00000000..bc78b730
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_18_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_18_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_18_224'
diff --git a/image_classification/CrossViT/run_eval_18_dagger_224.sh b/image_classification/CrossViT/run_eval_18_dagger_224.sh
new file mode 100644
index 00000000..ac43e893
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_18_dagger_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_18_dagger_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_18_dagger_224'
diff --git a/image_classification/CrossViT/run_eval_18_dagger_384.sh b/image_classification/CrossViT/run_eval_18_dagger_384.sh
new file mode 100644
index 00000000..bb08ade2
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_18_dagger_384.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_18_dagger_384.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_18_dagger_384'
diff --git a/image_classification/CrossViT/run_eval_9_224.sh b/image_classification/CrossViT/run_eval_9_224.sh
new file mode 100644
index 00000000..5bdcdb5e
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_9_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_9_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_9_224'
diff --git a/image_classification/CrossViT/run_eval_9_dagger_224.sh b/image_classification/CrossViT/run_eval_9_dagger_224.sh
new file mode 100644
index 00000000..245d8537
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_9_dagger_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_9_dagger_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_9_dagger_224'
diff --git a/image_classification/CrossViT/run_eval_base_224.sh b/image_classification/CrossViT/run_eval_base_224.sh
new file mode 100644
index 00000000..9306c8d5
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_base_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_base_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_base_224'
diff --git a/image_classification/CrossViT/run_eval_small_224.sh b/image_classification/CrossViT/run_eval_small_224.sh
new file mode 100644
index 00000000..159bde39
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_small_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_small_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_small_224'
diff --git a/image_classification/CrossViT/run_eval_tiny_224.sh b/image_classification/CrossViT/run_eval_tiny_224.sh
new file mode 100644
index 00000000..c4f211c3
--- /dev/null
+++ b/image_classification/CrossViT/run_eval_tiny_224.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/crossvit_tiny_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval  \
+-pretrained='./crossvit_tiny_224'
diff --git a/image_classification/CrossViT/run_train_multi_tiny_224.sh b/image_classification/CrossViT/run_train_multi_tiny_224.sh
new file mode 100644
index 00000000..c97a88b7
--- /dev/null
+++ b/image_classification/CrossViT/run_train_multi_tiny_224.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/crossvit_tiny_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/CrossViT/run_train_tiny_224.sh b/image_classification/CrossViT/run_train_tiny_224.sh
new file mode 100644
index 00000000..08ffef5c
--- /dev/null
+++ b/image_classification/CrossViT/run_train_tiny_224.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/crossvit_tiny_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/CrossViT/t2t.py b/image_classification/CrossViT/t2t.py
new file mode 100755
index 00000000..516abfa5
--- /dev/null
+++ b/image_classification/CrossViT/t2t.py
@@ -0,0 +1,335 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import numpy as np
+import paddle
+import paddle.nn as nn
+from crossvit_utils import DropPath, Identity, to_2tuple
+
+def get_sinusoid_encoding(n_position, d_hid):
+    ''' Sinusoid position encoding table '''
+
+    def get_position_angle_vec(position):
+        return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
+
+    sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
+    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
+    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
+
+    return paddle.to_tensor(sinusoid_table).unsqueeze(0)
+
+
+class Token_performer(nn.Layer):
+    def __init__(self, dim, in_dim, head_cnt=1, kernel_ratio=0.5, dp1=0.1, dp2=0.1):
+        # def __init__(self, dim, in_dim, head_cnt=1, kernel_ratio=0.5, dp1=0.0, dp2=0.0):
+        super().__init__()
+        self.emb = in_dim * head_cnt  # we use 1, so it is no need here
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.kqv = nn.Linear(dim, 3 * self.emb, weight_attr=w_attr_1, bias_attr=b_attr_1)
+        self.dp = nn.Dropout(dp1)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.proj = nn.Linear(self.emb, self.emb, weight_attr=w_attr_2, bias_attr=b_attr_2)
+        self.head_cnt = head_cnt
+        w_attr_3, b_attr_3 = self._init_weights_norm()
+        w_attr_4, b_attr_4 = self._init_weights_norm()
+        self.norm1 = nn.LayerNorm(dim, weight_attr=w_attr_3, bias_attr=b_attr_3)
+        self.norm2 = nn.LayerNorm(self.emb, weight_attr=w_attr_4, bias_attr=b_attr_4)
+        self.epsilon = 1e-8  # for stable in division
+
+        w_attr_5, b_attr_5 = self._init_weights()
+        w_attr_6, b_attr_6 = self._init_weights()
+        self.mlp = nn.Sequential(
+            nn.Linear(self.emb, 1 * self.emb, weight_attr=w_attr_5, bias_attr=b_attr_5),
+            nn.GELU(),
+            nn.Linear(1 * self.emb, self.emb, weight_attr=w_attr_6, bias_attr=b_attr_6),
+            nn.Dropout(dp2),
+        )
+
+        self.m = int(self.emb * kernel_ratio)
+        self.w = paddle.randn(self.m, self.emb)
+        # todo wait implement
+        # self.w = nn.Parameter(nn.init.orthogonal_(self.w) * math.sqrt(self.m), requires_grad=False)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights_norm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def prm_exp(self, x):
+        xd = ((x * x).sum(dim=-1, keepdim=True)).repeat(1, 1, self.m) / 2
+        wtx = paddle.matmul(x.float(), self.w, transpose_y=True)
+        #wtx = paddlenlp.ops.einsum('bti,mi->btm', x.float(), self.w)
+
+        return paddle.exp(wtx - xd) / math.sqrt(self.m)
+
+    def single_attn(self, x):
+        k, q, v = paddle.split(self.kqv(x), self.emb, axis=-1)
+        kp, qp = self.prm_exp(k), self.prm_exp(q)
+        D = paddle.matmul(qp, kp.sum(dim=1)).unsqueeze(dim=2)
+        #D = paddlenlp.ops.einsum('bti,bi->bt', qp, kp.sum(dim=1)).unsqueeze(dim=2)
+        kptv = paddle.matmul(v.float(), kp, transpose_x=True)
+        #kptv = paddlenlp.ops.einsum('bin,bim->bnm', v.float(), kp)  # (B, emb, m)
+        y = paddle.matmul(qp, kptv, transpose_y=True) / (D.repeat(1, 1, self.emb) + self.epsilon)
+        #y = paddlenlp.ops.einsum('bti,bni->btn', qp, kptv) / (D.repeat(1, 1, self.emb) + self.epsilon)
+        # skip connection
+        y = v + self.dp(self.proj(y))
+
+        return y
+
+    def forward(self, x):
+        x = self.single_attn(self.norm1(x))
+        x = x + self.mlp(self.norm2(x))
+        return x
+
+
+class Mlp(nn.Layer):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(in_features, hidden_features, weight_attr=w_attr_1, bias_attr=b_attr_1)
+        self.act = act_layer()
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(hidden_features, out_features, weight_attr=w_attr_2, bias_attr=b_attr_2)
+        self.drop = nn.Dropout(drop)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class Attention(nn.Layer):
+    def __init__(self, dim, num_heads=8, in_dim=None, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.in_dim = in_dim
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(dim, in_dim * 3, weight_attr=w_attr_1, bias_attr=b_attr_1)
+
+        self.attn_drop = nn.Dropout(attn_drop)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.proj = nn.Linear(in_dim, in_dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
+
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        B, N, C = x.shape
+
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.in_dim).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, self.in_dim)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+
+        # skip connection
+        x = v.squeeze(1) + x
+
+        return x
+
+
+class Token_transformer(nn.Layer):
+
+    def __init__(self, dim, in_dim, num_heads, mlp_ratio=1., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
+        super().__init__()
+        w_attr_1, b_attr_1 = self._init_weights_norm()
+        self.norm1 = norm_layer(dim, weight_attr=w_attr_1, bias_attr=b_attr_1)
+        self.attn = Attention(dim, in_dim=in_dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                              attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
+        self.norm2 = norm_layer(in_dim)
+        self.mlp = Mlp(in_features=in_dim, hidden_features=int(in_dim * mlp_ratio), out_features=in_dim,
+                       act_layer=act_layer, drop=drop)
+
+    def _init_weights_norm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.attn(self.norm1(x))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+
+class T2T(nn.Layer):
+    """
+    Tokens-to-Token encoding module
+    """
+
+    def __init__(self, img_size=224, patch_size=16, tokens_type='transformer', in_chans=3, embed_dim=768, token_dim=64):
+        super().__init__()
+
+        if patch_size == 12:
+            kernel_size = ((7, 4, 2), (3, 3, 1), (3, 1, 1))
+        elif patch_size == 16:
+            kernel_size = ((7, 4, 2), (3, 2, 1), (3, 2, 1))
+        else:
+            raise ValueError(f"Unknown patch size {patch_size}")
+
+        self.soft_split0 = nn.Unfold(kernel_sizes=to_2tuple(kernel_size[0][0]), strides=to_2tuple(kernel_size[0][1]),
+                                     paddings=to_2tuple(kernel_size[0][2]))
+        self.soft_split1 = nn.Unfold(kernel_sizes=to_2tuple(kernel_size[1][0]), strides=to_2tuple(kernel_size[1][1]),
+                                     paddings=to_2tuple(kernel_size[1][2]))
+        self.soft_split2 = nn.Unfold(kernel_sizes=to_2tuple(kernel_size[2][0]), strides=to_2tuple(kernel_size[2][1]),
+                                     paddings=to_2tuple(kernel_size[2][2]))
+
+        if tokens_type == 'transformer':
+
+            self.attention1 = Token_transformer(dim=in_chans * (kernel_size[0][0] ** 2), in_dim=token_dim, num_heads=1,
+                                                mlp_ratio=1.0)
+            self.attention2 = Token_transformer(dim=token_dim * (kernel_size[1][0] ** 2), in_dim=token_dim, num_heads=1,
+                                                mlp_ratio=1.0)
+            w_attr_1, b_attr_1 = self._init_weights()
+            self.project = nn.Linear(token_dim * (kernel_size[2][0] ** 2),
+                                     embed_dim,
+                                     weight_attr=w_attr_1,
+                                     bias_attr=b_attr_1)
+
+        elif tokens_type == 'performer':
+            self.attention1 = Token_performer(dim=in_chans * (kernel_size[0][0] ** 2), in_dim=token_dim,
+                                              kernel_ratio=0.5)
+            self.attention2 = Token_performer(dim=token_dim * (kernel_size[1][0] ** 2), in_dim=token_dim,
+                                              kernel_ratio=0.5)
+            w_attr_1, b_attr_1 = self._init_weights()
+            self.project = nn.Linear(token_dim * (kernel_size[2][0] ** 2),
+                                     embed_dim,
+                                     weight_attr=w_attr_1,
+                                     bias_attr=b_attr_1)
+
+        self.num_patches = (img_size // (kernel_size[0][1] * kernel_size[1][1] * kernel_size[2][1])) * (img_size // (
+                kernel_size[0][1] * kernel_size[1][1] * kernel_size[2][
+            1]))  # there are 3 sfot split, stride are 4,2,2 seperately
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        # step0: soft split
+        x = self.soft_split0(x).transpose(1, 2)
+
+        # iteration1: re-structurization/reconstruction
+        x = self.attention1(x)
+        B, new_HW, C = x.shape
+        x = x.transpose(1, 2).reshape(B, C, int(np.sqrt(new_HW)), int(np.sqrt(new_HW)))
+        # iteration1: soft split
+        x = self.soft_split1(x).transpose(1, 2)
+
+        # iteration2: re-structurization/reconstruction
+        x = self.attention2(x)
+        B, new_HW, C = x.shape
+        x = x.transpose(1, 2).reshape(B, C, int(np.sqrt(new_HW)), int(np.sqrt(new_HW)))
+        # iteration2: soft split
+        x = self.soft_split2(x).transpose(1, 2)
+
+        # final tokens
+        x = self.project(x)
+
+        return x
+
+
+class SharedT2T(nn.Layer):
+    """
+    Tokens-to-Token encoding module
+    """
+
+    def __init__(self, img_size=224, patch_size=16, tokens_type='transformer', in_chans=3, embed_dim=768, token_dim=64):
+        super().__init__()
+
+        if patch_size == 12:
+            kernel_size = ((7, 4, 2), (3, 3, 1), (3, 1, 1))
+        elif patch_size == 16:
+            kernel_size = ((7, 4, 2), (3, 2, 1), (3, 2, 1))
+        else:
+            raise ValueError(f"Unknown patch size {patch_size}")
+
+        if tokens_type == 'transformer':
+            # print('adopt transformer encoder for tokens-to-token')
+            self.soft_split0 = nn.Unfold(kernel_sizes=to_2tuple(kernel_size[0][0]),
+                                         strides=to_2tuple(kernel_size[0][1]), paddings=to_2tuple(kernel_size[0][2]))
+            self.soft_split1 = nn.Unfold(kernel_sizes=to_2tuple(kernel_size[1][0]),
+                                         strides=to_2tuple(kernel_size[1][1]), paddings=to_2tuple(kernel_size[1][2]))
+            self.soft_split2 = nn.Unfold(kernel_sizes=to_2tuple(kernel_size[2][0]),
+                                         strides=to_2tuple(kernel_size[2][1]), paddings=to_2tuple(kernel_size[2][2]))
+
+            self.attention1 = Token_transformer(dim=in_chans * (kernel_size[0][0] ** 2), in_dim=token_dim, num_heads=1,
+                                                mlp_ratio=1.0)
+            self.attention2 = Token_transformer(dim=token_dim * (kernel_size[1][0] ** 2), in_dim=token_dim, num_heads=1,
+                                                mlp_ratio=1.0)
+            w_attr_1, b_attr_1 = self._init_weights()
+            self.project = nn.Linear(token_dim * (kernel_size[2][0] ** 2),
+                                     embed_dim,
+                                     weight_attr=w_attr_1,
+                                     bias_attr=b_attr_1)
+
+        self.num_patches = (img_size // (kernel_size[0][1] * kernel_size[1][1] * kernel_size[2][1])) * (img_size // (
+                kernel_size[0][1] * kernel_size[1][1] * kernel_size[2][1]))
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        # step0: soft split
+        x = self.soft_split0(x).transpose(1, 2)
+
+        # iteration1: re-structurization/reconstruction
+        x = self.attention1(x)
+        B, new_HW, C = x.shape
+        x = x.transpose(1, 2).reshape(B, C, int(np.sqrt(new_HW)), int(np.sqrt(new_HW)))
+        # iteration1: soft split
+        x = self.soft_split1(x).transpose(1, 2)
+
+        # iteration2: re-structurization/reconstruction
+        x = self.attention2(x)
+        B, new_HW, C = x.shape
+        x = x.transpose(1, 2).reshape(B, C, int(np.sqrt(new_HW)), int(np.sqrt(new_HW)))
+        # iteration2: soft split
+        x = self.soft_split2(x).transpose(1, 2)
+
+        # final tokens
+        x = self.project(x)
+
+        return x
diff --git a/image_classification/CrossViT/transforms.py b/image_classification/CrossViT/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/CrossViT/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/CrossViT/utils.py b/image_classification/CrossViT/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/CrossViT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/CrossViT/zerbra.jpeg b/image_classification/CrossViT/zerbra.jpeg
new file mode 100644
index 00000000..31d95fb3
Binary files /dev/null and b/image_classification/CrossViT/zerbra.jpeg differ
diff --git a/image_classification/CvT/CvT.png b/image_classification/CvT/CvT.png
new file mode 100644
index 00000000..5a2cc625
Binary files /dev/null and b/image_classification/CvT/CvT.png differ
diff --git a/image_classification/CvT/README.md b/image_classification/CvT/README.md
new file mode 100644
index 00000000..8dcb8f7a
--- /dev/null
+++ b/image_classification/CvT/README.md
@@ -0,0 +1,172 @@
+# CvT: Introducing Convolutions to Vision Transformers, [arxiv](https://arxiv.org/abs/2103.15808) 
+
+PaddlePaddle training/validation code and pretrained models for **CvT**.
+
+The official pytorch implementation is [here](https://github.com/microsoft/CvT/).
+
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./CvT.png" alt="drawing" width="80%" height="80%"/>
+    <h4 align="center">CvT Model Overview</h4>
+</p>
+
+
+### Update 
+- Update (2021-12-24): Code is released and ported weights are uploaded.
+
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| CvT-13-224      | 81.59 | 95.67 | 20M    | 4.5G    | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1r0fnHn1bRPmN0mi8RwAPXmD4utDyOxEf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13xNwCGpdJ5MVUi369OGl5Q)(vev9) |
+| CvT-21-224      | 82.46 | 96.00 | 32M    | 7.1G    | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/18s7nRfvcmNdbRuEpTQe02AQE3Y9UWVQC/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mOjbMNoQb7X3VJD3LV0Hhg)(t2rv) |
+| CvT-13-384   	  | 83.00 | 96.36 | 20M    | 16.3G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1J0YYPUsiXSqyExBPtOPrOLL9c16syllg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1upITRr5lNHLjbBJtIr-jdg)(wswt) |
+| CvT-21-384   	  | 83.27 | 96.16 | 32M    | 24.9G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1tpXv_yYXtvyArlYi7AFcHUOqemhyMWHW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hXKi3Kb7mNxPFVmR6cdkMg)(hcem) |
+| CvT-13-384-22k  | 83.26 | 97.09 | 20M    | 16.3G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/18djrvq422u1pGLPxNfWAp6d17F7C5lbP/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YYv5rKPmroxKCnzkesUr0g)(c7m9) |
+| CvT-21-384-22k  | 84.91 | 97.62 | 32M    | 24.9G   | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1NVXd7vxVoRpL-21GN7nGn0-Ut0L0Owp8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1N3xNU6XFHb1CdEOrnjKuoA)(9jxe) |
+| CvT-w24-384-22k | 87.58 | 98.47 | 277M   | 193.2G  | 384        | 1.0        | bicubic       | [google](https://drive.google.com/file/d/1M3bg46N4SGtupK8FcvAOE0jltOwP5yja/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1MNJurm8juHRGG9SAw3IOkg)(bbj2) |
+
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./CvT-13-224x224-IN-1k.pdparams`, to use the `CvT-13-224x224-IN-1k` model in python:
+```python
+from config import get_config
+from cvt import build_cvt as build_model
+# config files in ./configs/
+config = get_config('./configs/cvt-13-224x224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./CvT-13-224x224-IN-1k')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate CvT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/cvt-13-224x224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./CvT-13-224x224-IN-1k'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/cvt-13-224x224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./CvT-13-224x224-IN-1k'
+```
+
+</details>
+
+## Training
+To train the CvT Transformer model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg='./configs/cvt-13-224x224.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/cvt-13-224x224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \ 
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{wu2021cvt,
+title={CvT: Introducing Convolutions to Vision Transformers},
+author={Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang},
+journal={arXiv preprint arXiv:2103.15808},
+year={2021}
+}
+```
diff --git a/image_classification/CvT/augment.py b/image_classification/CvT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/CvT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/CvT/config.py b/image_classification/CvT/config.py
new file mode 100644
index 00000000..a1f57199
--- /dev/null
+++ b/image_classification/CvT/config.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 256 # input image size
+_C.DATA.CROP_PCT = 0.94 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'CvT'
+_C.MODEL.NAME = 'CvT'
+_C.MODEL.INIT_WEIGHTS = True
+_C.MODEL.PRETRAINED = ''
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED_LAYERS = ['*']
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.NUM_STAGES=3
+_C.MODEL.PATCH_SIZE=[7, 3, 3]
+_C.MODEL.PATCH_STRIDE=[4 ,2, 2]
+_C.MODEL.PATCH_PADDING=[ 2, 1, 1]
+_C.MODEL.DIM_EMBED=[64, 192, 384]
+_C.MODEL.DEPTH=[1,2,10]
+_C.MODEL.NUM_HEADS=[1, 3, 6]
+_C.MODEL.DROP_RATE=[0.0, 0.0, 0.0]
+_C.MODEL.ATTN_DROP_RATE=[0.0, 0.0, 0.0]
+_C.MODEL.DROP_PATH_RATE=[0.0, 0.0, 0.1]
+_C.MODEL.CLS_TOKEN=[False, False, True]
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 5
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.02
+_C.TRAIN.WARMUP_START_LR = 2e-6
+_C.TRAIN.END_LR = 2e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = False
+_C.TRAIN.MODEL_EMA_DECAY = 0.99992
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/CvT/configs/cvt-13-224x224.yaml b/image_classification/CvT/configs/cvt-13-224x224.yaml
new file mode 100644
index 00000000..12536329
--- /dev/null
+++ b/image_classification/CvT/configs/cvt-13-224x224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CvT
+    NAME: CvT-13-224
+    NUM_CLASSES: 1000
+    NUM_STAGES: 3
+    PATCH_SIZE: [7, 3, 3]
+    PATCH_STRIDE: [4, 2, 2]
+    PATCH_PADDING: [2, 1, 1]
+    DIM_EMBED: [64, 192, 384]
+    NUM_HEADS: [1, 3, 6]
+    DEPTH: [1, 2, 10]
+    ATTN_DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_PATH_RATE: [0.0, 0.0, 0.1]
+    CLS_TOKEN: [False, False, True]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 2.5e-4
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 256
\ No newline at end of file
diff --git a/image_classification/CvT/configs/cvt-13-384x384.yaml b/image_classification/CvT/configs/cvt-13-384x384.yaml
new file mode 100644
index 00000000..c839d537
--- /dev/null
+++ b/image_classification/CvT/configs/cvt-13-384x384.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: CvT
+    NAME: CvT-13-384
+    NUM_CLASSES: 1000
+    NUM_STAGES: 3
+    PATCH_SIZE: [7, 3, 3]
+    PATCH_STRIDE: [4, 2, 2]
+    PATCH_PADDING: [2, 1, 1]
+    DIM_EMBED: [64, 192, 384]
+    NUM_HEADS: [1, 3, 6]
+    DEPTH: [1, 2, 10]
+    ATTN_DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_PATH_RATE: [0.0, 0.0, 0.1]
+    CLS_TOKEN: [False, False, True]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 2.5e-4
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 256
\ No newline at end of file
diff --git a/image_classification/CvT/configs/cvt-21-224x224.yaml b/image_classification/CvT/configs/cvt-21-224x224.yaml
new file mode 100644
index 00000000..5799b739
--- /dev/null
+++ b/image_classification/CvT/configs/cvt-21-224x224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CvT
+    NAME: CvT-21-224
+    NUM_CLASSES: 1000
+    NUM_STAGES: 3
+    PATCH_SIZE: [7, 3, 3]
+    PATCH_STRIDE: [4, 2, 2]
+    PATCH_PADDING: [2, 1, 1]
+    DIM_EMBED: [64, 192, 384]
+    NUM_HEADS: [1, 3, 6]
+    DEPTH: [1, 4, 16]
+    ATTN_DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_PATH_RATE: [0.0, 0.0, 0.1]
+    CLS_TOKEN: [False, False, True]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.1
+    BASE_LR: 1.25e-4
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 128
\ No newline at end of file
diff --git a/image_classification/CvT/configs/cvt-21-384x384.yaml b/image_classification/CvT/configs/cvt-21-384x384.yaml
new file mode 100644
index 00000000..995b4aae
--- /dev/null
+++ b/image_classification/CvT/configs/cvt-21-384x384.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: CvT
+    NAME: CvT-21-384
+    NUM_CLASSES: 1000
+    NUM_STAGES: 3
+    PATCH_SIZE: [7, 3, 3]
+    PATCH_STRIDE: [4, 2, 2]
+    PATCH_PADDING: [2, 1, 1]
+    DIM_EMBED: [64, 192, 384]
+    NUM_HEADS: [1, 3, 6]
+    DEPTH: [1, 4, 16]
+    ATTN_DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_PATH_RATE: [0.0, 0.0, 0.1]
+    CLS_TOKEN: [False, False, True]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.1
+    BASE_LR: 1.25e-4
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 128
diff --git a/image_classification/CvT/configs/cvt-w24-384x384.yaml b/image_classification/CvT/configs/cvt-w24-384x384.yaml
new file mode 100644
index 00000000..be22f660
--- /dev/null
+++ b/image_classification/CvT/configs/cvt-w24-384x384.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: CvT
+    NAME: CvT-w24-384
+    NUM_CLASSES: 1000
+    NUM_STAGES: 3
+    PATCH_SIZE: [7, 3, 3]
+    PATCH_STRIDE: [4, 2, 2]
+    PATCH_PADDING: [2, 1, 1]
+    DIM_EMBED: [192, 768, 1024]
+    NUM_HEADS: [3, 12, 16]
+    DEPTH: [2, 2, 20]
+    ATTN_DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_RATE: [0.0, 0.0, 0.0]
+    DROP_PATH_RATE: [0.0, 0.0, 0.3]
+    CLS_TOKEN: [False, False, True]
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.1
+    BASE_LR: 1.25e-4
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    LINEAR_SCALED_LR: 128
diff --git a/image_classification/CvT/cvt.py b/image_classification/CvT/cvt.py
new file mode 100644
index 00000000..5d4c0915
--- /dev/null
+++ b/image_classification/CvT/cvt.py
@@ -0,0 +1,623 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for ViT
+"""
+
+import paddle
+import paddle.nn as nn
+
+from numpy import repeat
+import os
+from droppath import DropPath
+
+class QuickGELU(nn.Layer):
+    '''
+    Rewrite GELU function to increase processing speed
+    '''
+
+    def forward(self, x: paddle.Tensor):
+        return x * nn.functional.sigmoid(1.702 * x)
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+
+    def __init__(self,
+                 embed_dim,
+                 mlp_ratio,
+                 act_layer=nn.GELU,
+                 dropout=0.):
+        super().__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(embed_dim,
+                             int(embed_dim * mlp_ratio),
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio),
+                             embed_dim,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = act_layer()
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(
+            initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(
+            initializer=nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout1(x)
+        x = self.fc2(x)
+        x = self.dropout2(x)
+        return x
+
+
+class ConvEmbed(nn.Layer):
+    """ Image to Conv Embedding
+    using nn.Conv2D and norm_layer to embedd the input.
+    Ops: conv -> norm.
+    Attributes:
+        conv: nn.Conv2D
+        norm: nn.LayerNorm
+    nn.LayerNorm handle thr input with one dim, so we should
+    stretch 2D input into 1D
+
+    """
+
+    def __init__(self,
+                 patch_size=7,
+                 in_chans=3,
+                 embed_dim=64,
+                 stride=4,
+                 padding=2,
+                 norm_layer=None):
+        super().__init__()
+        # conv patch_size to a square,which shape is(patch_size,patch_size)
+        patch_size = tuple(repeat((patch_size), 2))
+
+        self.patch_size = patch_size
+        self.proj = nn.Conv2D(
+            in_chans, embed_dim,
+            kernel_size=patch_size,
+            stride=stride,
+            padding=padding
+        )
+        self.norm = norm_layer(embed_dim) if norm_layer else None
+
+    def forward(self, x):
+        x = self.proj(x)
+        B, C, H, W = x.shape
+        x = paddle.transpose(x, [0, 2, 3, 1])
+        x = paddle.reshape(x, [B, H*W, C])
+        if self.norm:
+            x = self.norm(x)
+        x = paddle.transpose(x, [0, 2, 1])
+        x = paddle.reshape(x, [B, C, H, W])
+        return x
+
+
+class Attention(nn.Layer):
+    """ Attention module
+    Attention module for CvT.
+    using conv to calculate q,k,v
+    Attributes:
+        num_heads: number of heads
+        qkv: a nn.Linear for q, k, v mapping
+            dw_bn: nn.Conv2D -> nn.BatchNorm
+            avg: nn.AvgPool2D
+            linear: None
+        scales: 1 / sqrt(single_head_feature_dim)
+        attn_drop: dropout for attention
+        proj_drop: final dropout before output
+        out: projection of multi-head attention
+    """
+
+    def __init__(self,
+                 dim_in,
+                 dim_out,
+                 num_heads,
+                 qkv_bias=False,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 kernel_size=3,
+                 stride_kv=2,
+                 stride_q=1,
+                 padding_kv=1,
+                 padding_q=1,
+                 with_cls_token=True,
+                 **kwargs
+                 ):
+        super().__init__()
+        # init to save the pararm
+        self.stride_kv = stride_kv
+        self.stride_q = stride_q
+        self.dim = dim_out
+        self.num_heads = num_heads
+        self.scale = dim_out ** -0.5
+        self.with_cls_token = with_cls_token
+
+        # calculate q,k,v with conv
+        self.conv_proj_q = self._build_projection(
+            dim_in, dim_out, kernel_size, padding_q,
+            stride_q, 
+        )
+        self.conv_proj_k = self._build_projection(
+            dim_in, dim_out, kernel_size, padding_kv,
+            stride_kv, 
+        )
+        self.conv_proj_v = self._build_projection(
+            dim_in, dim_out, kernel_size, padding_kv,
+            stride_kv,
+        )
+
+        # init parameters of q,k,v
+        w_attr_1, b_attr_1 = self._init_weights()
+        w_attr_2, b_attr_2 = self._init_weights()
+        w_attr_3, b_attr_3 = self._init_weights()
+        self.proj_q = nn.Linear(dim_in, dim_out, weight_attr=w_attr_1, bias_attr=b_attr_1 if qkv_bias else False)
+        self.proj_k = nn.Linear(dim_in, dim_out, weight_attr=w_attr_2, bias_attr=b_attr_2 if qkv_bias else False)
+        self.proj_v = nn.Linear(dim_in, dim_out, weight_attr=w_attr_3, bias_attr=b_attr_3 if qkv_bias else False)
+
+        # init project other parameters
+        self.attn_drop = nn.Dropout(attn_drop)
+        w_attr_4, b_attr_4 = self._init_weights()
+        self.proj = nn.Linear(dim_out, dim_out, weight_attr=w_attr_4, bias_attr=b_attr_4)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _build_projection(self,
+                          dim_in,
+                          dim_out,
+                          kernel_size,
+                          padding,
+                          stride,
+                         ):
+        
+        proj = nn.Sequential(
+            (nn.Conv2D(
+                dim_in,
+                dim_in,
+                kernel_size=kernel_size,
+                padding=padding,
+                stride=stride,
+                bias_attr=False,
+                groups=dim_in
+            )),
+            (nn.BatchNorm2D(dim_in)),
+            
+        )
+
+        return proj
+
+    def forward_conv(self, x, h, w):
+        if self.with_cls_token:  # spilt token from x
+            cls_token, x = paddle.split(x, [1, h*w], 1)
+        B, L, C = x.shape  # L is length of tensor
+        x = paddle.transpose(x, [0, 2, 1])
+        x = paddle.reshape(x, [B, C, h, w])
+        if self.conv_proj_q is not None:
+            q = self.conv_proj_q(x)
+            B, C, H, W = q.shape
+            q = paddle.transpose(q, [0, 2, 3, 1])
+            q = paddle.reshape(q, [B, H*W, C])
+        else:
+            B, C, H, W = x.shape
+            q = paddle.transpose(x, [0, 2, 3, 1])
+            q = paddle.reshape(q, [B, H*W, C])
+        if self.conv_proj_k is not None:
+            k = self.conv_proj_k(x)
+            B, C, H, W = k.shape
+            k = paddle.transpose(k, [0, 2, 3, 1])
+            k = paddle.reshape(k, [B, H*W, C])
+        else:
+            B, C, H, W = x.shape
+            k = paddle.transpose(x, [0, 2, 3, 1])
+            k = paddle.reshape(k, [B, H*W, C])
+        if self.conv_proj_v is not None:
+            v = self.conv_proj_v(x)
+            B, C, H, W = v.shape
+            v = paddle.transpose(v, [0, 2, 3, 1])
+            v = paddle.reshape(v, [B, H*W, C])
+        else:
+            # v = graph2vector(x)
+            B, C, H, W = x.shape
+            v = paddle.transpose(x, [0, 2, 3, 1])
+            v = paddle.reshape(v, [B, H*W, C])
+        if self.with_cls_token:
+            q = paddle.concat([cls_token, q], axis=1)
+            k = paddle.concat([cls_token, k], axis=1)
+            v = paddle.concat([cls_token, v], axis=1)
+
+        return q, k, v
+
+    def forward(self, x, h, w):
+        if (
+            self.conv_proj_q is not None
+            or self.conv_proj_k is not None
+            or self.conv_proj_v is not None
+        ):  # if not generate q,k,v with Linear param
+            q, k, v = self.forward_conv(x, h, w)
+
+        # now q,k,v is b (h w) c
+        h=self.num_heads
+        q=self.proj_q(q)
+        B, T, L = q.shape
+        q = paddle.reshape(q, [B, T, h, -1])
+        q = paddle.transpose(q, [0, 2, 1, 3])
+        k=self.proj_k(k)
+        B, T, L = k.shape
+        k = paddle.reshape(k, [B, T, h, -1])
+        k = paddle.transpose(k, [0, 2, 1, 3])
+        v=self.proj_v(v)
+        B, T, L = v.shape
+        v = paddle.reshape(v, [B, T, h, -1])
+        v = paddle.transpose(v, [0, 2, 1, 3])
+
+        # multi tensor with axis=3，then * scale，achieve the result of q*k/sqort(d_k),
+        attn_score = paddle.matmul(q, k, transpose_y=True) * self.scale
+        attn = nn.functional.softmax(attn_score, axis=-1)
+        attn = self.attn_drop(attn)
+
+        x = paddle.matmul(attn, v)
+        x = paddle.transpose(x, [0, 2, 1, 3])
+        x = paddle.reshape(x, [0, 0, -1])
+
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x  # b,t,(h,d)
+
+
+class Block(nn.Layer):
+    ''' Block moudule
+    Ops: token -> multihead attention (reshape token to a grap) ->Mlp->token
+    '''
+
+    def __init__(self,
+                 dim_in,
+                 dim_out,
+                 num_heads,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 **kwargs):
+        super().__init__()
+
+        self.with_cls_token = kwargs['with_cls_token']
+
+        self.norm1 = norm_layer(dim_in)
+        self.attn = Attention(
+            dim_in, dim_out, num_heads, qkv_bias, attn_drop, drop,
+            **kwargs
+        )
+        if drop_path > 0.:
+            self.drop_path = DropPath(drop_path)
+        else:
+            self.drop_path = nn.Identity()
+
+        self.norm2 = norm_layer(dim_out)
+        self.mlp = Mlp(
+            dim_out,
+            mlp_ratio,
+            act_layer=act_layer,
+            dropout=drop
+        )
+
+    def forward(self, x, h, w):
+        res = x
+        x = self.norm1(x)
+        attn = self.attn(x, h, w)
+        x = res + self.drop_path(attn)
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+
+class VisionTransformer(nn.Layer):
+    """ VisionTransformer moudule
+    Vision Transformer with support for patch or hybrid CNN input stage
+    Ops:intput -> conv_embed -> depth*block -> out
+    Attribute:
+        input: raw picture
+        out: features,cls_token
+
+    """
+
+    def __init__(self,
+                 patch_size=16,
+                 patch_stride=16,
+                 patch_padding=0,
+                 in_chans=3,
+                 embed_dim=768,
+                 depth=12,
+                 num_heads=12,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 act_layer=QuickGELU,
+                 norm_layer=nn.LayerNorm,
+                 init='trunc_norm',
+                 **kwargs):
+        super().__init__()
+        # num_features for consistency with other models
+        self.num_features = self.embed_dim = embed_dim
+
+        self.patch_embed = ConvEmbed(
+            patch_size=patch_size,
+            in_chans=in_chans,
+            stride=patch_stride,
+            padding=patch_padding,
+            embed_dim=embed_dim,
+            norm_layer=norm_layer
+        )
+
+        with_cls_token = kwargs['with_cls_token']
+
+        if with_cls_token:
+            self.cls_token = paddle.create_parameter(
+                shape=[1, 1, embed_dim],
+                dtype='float32',
+                default_initializer=nn.initializer.TruncatedNormal(std=.02))
+        else:
+            self.cls_token = None
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, depth)]
+
+        blocks = []
+        for j in range(depth):
+            blocks.append(
+                Block(
+                    dim_in=embed_dim,
+                    dim_out=embed_dim,
+                    num_heads=num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=dpr[j],
+                    act_layer=act_layer,
+                    norm_layer=norm_layer,
+                    **kwargs
+                )
+            )
+        self.blocks = nn.LayerList(blocks)
+
+        if init == 'xavier':
+            self.apply(self._init_weights_xavier)
+        else:
+            self.apply(self._init_weights_trunc_normal)
+
+    def _init_weights_trunc_normal(self, m):
+        if isinstance(m, nn.Linear):
+            trun_init = nn.initializer.TruncatedNormal(std=0.02)
+            trun_init(m.weight)
+            if m.bias is not None:
+                zeros = nn.initializer.Constant(0.)
+                zeros(m.bias)
+        elif isinstance(m, (nn.LayerNorm, nn.BatchNorm2D)):
+            zeros = nn.initializer.Constant(0.)
+            zeros(m.bias)
+            ones = nn.initializer.Constant(1.0)
+            ones(m.weight)
+
+    def _init_weights_xavier(self, m):
+        if isinstance(m, nn.Linear):
+            xavier_init = nn.initializer.XavierNormal()
+            xavier_init(m.weight)
+            if m.bias is not None:
+                zeros = nn.initializer.Constant(0.)
+            zeros(m.bias)
+        elif isinstance(m, (nn.LayerNorm, nn.BatchNorm2D)):
+            zeros = nn.initializer.Constant(0.)
+            zeros(m.bias)
+            ones = nn.initializer.Constant(1)
+            ones(m.weight)
+
+    def forward(self, x):
+        x = self.patch_embed(x)
+        B, C, H, W = x.shape
+        B, C, H, W = x.shape
+        x = paddle.transpose(x, [0, 2, 3, 1])
+        x = paddle.reshape(x, [B, H*W, C])
+        cls_tokens = None
+        if self.cls_token is not None:
+            cls_tokens = paddle.expand(self.cls_token, [B, -1, -1])
+            x = paddle.concat([cls_tokens, x], axis=1)
+        x = self.pos_drop(x)
+        for i, blk in enumerate(self.blocks):
+            x = blk(x, H, W)
+        if self.cls_token is not None:
+            cls_tokens, x = paddle.split(x, [1, H*W], 1)
+        B, L, C = x.shape  # L is length of tensor
+        x = paddle.transpose(x, [0, 2, 1])
+        x = paddle.reshape(x, [B, C, H, W])
+        return x, cls_tokens
+
+
+class ConvolutionalVisionTransformer(nn.Layer):
+    '''CvT model
+    Introducing Convolutions to Vision Transformers
+    Args:
+        in_chans: int, input image channels, default: 3
+        num_classes: int, number of classes for classification, default: 1000
+        num_stage: int, numebr of stage, length of array of parameters should be given, default:3 
+        patch_size: int[], patch size, default: [7, 3, 3]
+        patch_stride: int[], patch_stride ,default: [4, 2, 2]
+        patch_padding: int[], patch padding,default: [2, 1, 1]
+        embed_dim: int[], embedding dimension (patch embed out dim), default: [64, 192, 384]
+        depth: int[], number ot transformer blocks, default: [1, 2, 10]
+        num_heads: int[], number of attention heads, default: [1, 3, 6]
+        drop_rate: float[], Mlp layer's droppath rate for droppath layers, default: [0.0, 0.0, 0.0]
+        attn_drop_rate: float[], attention layer's droppath rate for droppath layers, default: [0.0, 0.0, 0.0]
+        drop_path_rate: float[], each block's droppath rate for droppath layers, default: [0.0, 0.0, 0.1]
+        with_cls_token: bool[], if image have cls_token, default: [False, False, True]
+    '''
+
+    def __init__(self,
+                 in_chans=3,
+                 num_classes=1000,
+                 num_stage=3,
+                 patch_size=[7, 3, 3],
+                 patch_stride=[4, 2, 2],
+                 patch_padding=[2, 1, 1],
+                 embed_dim=[64, 192, 384],
+                 depth=[1, 2, 10],
+                 num_heads=[1, 3, 6],
+                 drop_rate=[0.0, 0.0, 0.0],
+                 attn_drop_rate=[0.0, 0.0, 0.0],
+                 drop_path_rate=[0.0, 0.0, 0.1],
+                 with_cls_token=[False, False, True],
+                 ):
+        super().__init__()
+        self.num_classes = num_classes
+
+        self.num_stages = num_stage
+        self.stages=nn.LayerList()
+        for i in range(self.num_stages):
+
+            stage = VisionTransformer(
+                in_chans=in_chans,
+                patch_size= patch_size[i],
+                patch_stride= patch_stride[i],
+                patch_padding= patch_padding[i],
+                embed_dim= embed_dim[i],
+                depth= depth[i],
+                num_heads= num_heads[i],
+                mlp_ratio= 4.0,
+                qkv_bias= True,
+                drop_rate= drop_rate[i],
+                attn_drop_rate= attn_drop_rate[i],
+                drop_path_rate= drop_path_rate[i],
+                with_cls_token= with_cls_token[i],
+            )
+            self.stages.append(stage)
+            in_chans = embed_dim[i]
+
+        dim_embed = embed_dim[-1]
+        self.norm = nn.LayerNorm(dim_embed)
+        self.cls_token = with_cls_token[-1]
+
+        # Classifier head
+        self.head = nn.Linear(
+            dim_embed, num_classes) if num_classes > 0 else nn.Identity()
+        trunc_init = nn.initializer.TruncatedNormal(std=0.02)
+        trunc_init(self.head.weight)
+
+    def init_weights(self, pretrained='', pretrained_layers=[], verbose=True):
+        if os.path.isfile(pretrained):
+            pretrained_dict = paddle.load(pretrained, map_location='cpu')
+            model_dict = self.state_dict()
+            pretrained_dict = {
+                k: v for k, v in pretrained_dict.items()
+                if k in model_dict.keys()
+            }
+            need_init_state_dict = {}
+            for k, v in pretrained_dict.items():
+                need_init = (
+                    k.split('.')[0] in pretrained_layers
+                    or pretrained_layers[0] is '*'
+                )
+                if need_init:
+                    if 'pos_embed' in k and v.size() != model_dict[k].size():
+                        size_pretrained = v.size()
+                        size_new = model_dict[k].size()
+
+                        ntok_new = size_new[1]
+                        ntok_new -= 1
+
+                        posemb_tok, posemb_grid = v[:, :1], v[0, 1:]
+
+                        gs_old = int(paddle.sqrt(len(posemb_grid)))
+                        gs_new = int(paddle.sqrt(ntok_new))
+
+                        posemb_grid = posemb_grid.reshape(gs_old, gs_old, -1)
+                        zoom = (gs_new / gs_old, gs_new / gs_old, 1)
+                        posemb_grid = paddle.ndimage.zoom(
+                            posemb_grid, zoom, order=1
+                        )
+                        posemb_grid = posemb_grid.reshape(1, gs_new ** 2, -1)
+                        v = paddle.to_tensor(
+                            paddle.concat([posemb_tok, posemb_grid], axis=1)
+                        )
+
+                    need_init_state_dict[k] = v
+            self.load_state_dict(need_init_state_dict, strict=False)
+
+    def forward_features(self, x):
+        for i in range(self.num_stages):
+            x, cls_tokens = self.stages[i](x)
+            
+        if self.cls_token:
+            x = self.norm(cls_tokens)
+            x = paddle.squeeze(x)
+        else:
+            #'b c h w -> b (h w) c'
+            B, C, H, W = x.shape
+            x = paddle.transpose(x, [0, 2, 3, 1])
+            x = paddle.reshape(x, [B, H*W, C])
+            x = self.norm(x)
+            x = paddle.mean(x, axis=1)
+
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+
+def build_cvt(config):
+    model = ConvolutionalVisionTransformer(
+        in_chans=3,
+        num_classes=config.MODEL.NUM_CLASSES,
+        num_stage=config.MODEL.NUM_STAGES,
+        patch_size=config.MODEL.PATCH_SIZE,
+        patch_stride=config.MODEL.PATCH_STRIDE,
+        patch_padding=config.MODEL.PATCH_PADDING,
+        embed_dim=config.MODEL.DIM_EMBED,
+        depth=config.MODEL.DEPTH,
+        num_heads=config.MODEL.NUM_HEADS,
+        drop_rate=config.MODEL.DROP_RATE,
+        attn_drop_rate=config.MODEL.ATTN_DROP_RATE,
+        drop_path_rate=config.MODEL.DROP_PATH_RATE,
+        with_cls_token=config.MODEL.CLS_TOKEN
+    )
+    return model
diff --git a/image_classification/CvT/datasets.py b/image_classification/CvT/datasets.py
new file mode 100644
index 00000000..ec2f82ed
--- /dev/null
+++ b/image_classification/CvT/datasets.py
@@ -0,0 +1,223 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    if config.DATA.IMAGE_SIZE == 384: # for CvT, use overall resize instead of shorter side
+        scale_size = [scale_size, scale_size]
+
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, 'bicubic'), # For CvT 384 int
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/CvT/droppath.py b/image_classification/CvT/droppath.py
new file mode 100644
index 00000000..c8fe8048
--- /dev/null
+++ b/image_classification/CvT/droppath.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/CvT/losses.py b/image_classification/CvT/losses.py
new file mode 100644
index 00000000..ad5abcb3
--- /dev/null
+++ b/image_classification/CvT/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
\ No newline at end of file
diff --git a/image_classification/CvT/main_multi_gpu.py b/image_classification/CvT/main_multi_gpu.py
new file mode 100644
index 00000000..dc1ed0d9
--- /dev/null
+++ b/image_classification/CvT/main_multi_gpu.py
@@ -0,0 +1,584 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""CvT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from cvt import build_cvt as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CvT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+
+        if amp is True:  # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:  # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+        filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+        logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+        )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED + '.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch + 1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch + 1}.")
+    for epoch in range(last_epoch + 1, config.TRAIN.NUM_EPOCHS + 1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val,), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CvT/main_single_gpu.py b/image_classification/CvT/main_single_gpu.py
new file mode 100644
index 00000000..2a858a29
--- /dev/null
+++ b/image_classification/CvT/main_single_gpu.py
@@ -0,0 +1,427 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""CvT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from cvt import build_cvt as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CvT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+
+        if amp is True:  # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:  # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 6: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+        )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED + '.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch + 1}.")
+    for epoch in range(last_epoch + 1, config.TRAIN.NUM_EPOCHS + 1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CvT/mixup.py b/image_classification/CvT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/CvT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/CvT/random_erasing.py b/image_classification/CvT/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/CvT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/CvT/run_eval.sh b/image_classification/CvT/run_eval.sh
new file mode 100644
index 00000000..11475fb7
--- /dev/null
+++ b/image_classification/CvT/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/cvt-13-224x224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=8 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./cvt_13_new'
diff --git a/image_classification/CvT/run_eval_multi.sh b/image_classification/CvT/run_eval_multi.sh
new file mode 100644
index 00000000..a340968c
--- /dev/null
+++ b/image_classification/CvT/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+    -cfg='./configs/cvt-w24-384x384.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./CvT-w24-384x384-IN-22k'
diff --git a/image_classification/CvT/run_train.sh b/image_classification/CvT/run_train.sh
new file mode 100644
index 00000000..d48ac4ed
--- /dev/null
+++ b/image_classification/CvT/run_train.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/cvt-13-224x224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=256 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./CvT-13-224x224-IN-1k'
\ No newline at end of file
diff --git a/image_classification/CvT/run_train_multi.sh b/image_classification/CvT/run_train_multi.sh
new file mode 100644
index 00000000..3f3aef29
--- /dev/null
+++ b/image_classification/CvT/run_train_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/cvt-13-224x224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    #-amp
diff --git a/image_classification/CvT/utils.py b/image_classification/CvT/utils.py
new file mode 100644
index 00000000..1893f9ee
--- /dev/null
+++ b/image_classification/CvT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
\ No newline at end of file
diff --git a/image_classification/CycleMLP/README.md b/image_classification/CycleMLP/README.md
new file mode 100644
index 00000000..460fa738
--- /dev/null
+++ b/image_classification/CycleMLP/README.md
@@ -0,0 +1,177 @@
+# CycleMLP: A MLP-like Architecture for Dense Prediction, [arXiv](https://arxiv.org/abs/2107.10224)
+
+PaddlePaddle training/validation code and pretrained models for **CycleMLP**.
+
+The official and 3rd party pytorch implementation are [here](https://github.com/ShoufaChen/CycleMLP).
+
+
+This implementation is developed by [PPViT](https://github.com/xperzy/PPViT/tree/master).
+
+<p align="center">
+<img src="./cyclemlp.png" alt="drawing" width="60%" height="60%"/>
+<h4 align="center">CycleMLP Model Overview</h4>
+</p>
+
+
+
+### Update 
+Update (2021-09-24): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model       | Acc@1 | Acc@5 | #Params | Image Size | Crop_pct | Interpolation | Link                                                         |
+| ----------- | ----- | ----- | ------- | ---------- | -------- | ------------- | ------------------------------------------------------------ |
+| cyclemlp_b1 | 78.85 | 94.60 | 15.1M   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/10WQenRy9lfOJF4xEHc9Mekp4zHRh0mJ_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/11UQp1RkWBsZFOqit_uU80w)(mnbr) |
+| cyclemlp_b2 | 81.58 | 95.81 | 26.8M   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1dtQHCwtxNh9jgiHivN5iYpHe7uKRUjhk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Js-Oq5vyiB7oPagn43cn3Q)(jwj9) |
+| cyclemlp_b3 | 82.42 | 96.07 | 38.3M   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/11kMq112tAwVE5llJIepIIixz74AjaJhU/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1b7cau1yPxqATA8X7t2DXkw)(v2fy) |
+| cyclemlp_b4 | 82.96 | 96.33 | 51.8M   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vwJ0eD9Ic-NvLvCz1zEAmn7RxBMtd_v2/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1P3TlnXRFGWj9nVP5xBGGWQ)(fnqd) |
+| cyclemlp_b5 | 83.25 | 96.44 | 75.7M   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/12_I4cfOBfp7kC0RvmnMXFqrSxww6plRW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-Cka1tNqGUQutkAP3VZXzQ)(s55c) |
+
+
+
+
+
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+> 
+> Note: CycleMLP weights are ported from [here](https://github.com/ShoufaChen/CycleMLP)
+
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./cyclemlp_b1.pdparams`, to use the `cyclemlp_b1` model in python:
+```python
+from config import get_config
+from cyclemlp import build_cyclemlp as build_model
+# config files in ./configs/
+config = get_config('./configs/cyclemlp_b1.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./cyclemlp_b1.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate CycleMLP model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/cyclemlp_b1.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/cyclemlp_b1  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/cyclemlp_b1.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/cyclemlp_b1 # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the CycleMLP model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/cyclemlp_b1.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=32 \
+  -data_path=/path/to/dataset/imagenet/train \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/cyclemlp_b1.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train \ 
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{chen2021cyclemlp,
+  title={CycleMLP: A MLP-like Architecture for Dense Prediction},
+  author={Chen, Shoufa and Xie, Enze and Ge, Chongjian and Liang, Ding and Luo, Ping},
+  journal={arXiv preprint arXiv:2107.10224},
+  year={2021}
+}
+```
diff --git a/image_classification/CycleMLP/__init__.py b/image_classification/CycleMLP/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/CycleMLP/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/CycleMLP/augment.py b/image_classification/CycleMLP/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/CycleMLP/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/CycleMLP/config.py b/image_classification/CycleMLP/config.py
new file mode 100644
index 00000000..a0b94cb2
--- /dev/null
+++ b/image_classification/CycleMLP/config.py
@@ -0,0 +1,178 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'CycleMLP'
+_C.MODEL.NAME = 'CycleMLP'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+# transformer settings
+_C.MODEL.MIXER = CN()
+_C.MODEL.MIXER.TRANSITIONS = [True, True, True, True]
+_C.MODEL.MIXER.LAYERS = [2, 2, 4, 2]
+_C.MODEL.MIXER.MLP_RATIOS = [4, 4, 4, 4]
+_C.MODEL.MIXER.EMBED_DIMS = [64, 128, 320, 512]
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 20 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/CycleMLP/configs/cyclemlp_b1.yaml b/image_classification/CycleMLP/configs/cyclemlp_b1.yaml
new file mode 100644
index 00000000..d06d646b
--- /dev/null
+++ b/image_classification/CycleMLP/configs/cyclemlp_b1.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: CycleMLP
+    NAME: cyclemlp_b1
+    MIXER:
+        TRANSITIONS: [True, True, True, True]
+        LAYERS: [2, 2, 4, 2]
+        MLP_RATIOS: [4, 4, 4, 4]
+        EMBED_DIMS: [64, 128, 320, 512]
diff --git a/image_classification/CycleMLP/configs/cyclemlp_b2.yaml b/image_classification/CycleMLP/configs/cyclemlp_b2.yaml
new file mode 100644
index 00000000..3ac3859f
--- /dev/null
+++ b/image_classification/CycleMLP/configs/cyclemlp_b2.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: CycleMLP
+    NAME: cyclemlp_b2
+    MIXER:
+        TRANSITIONS: [True, True, True, True]
+        LAYERS: [2, 3, 10, 3]
+        MLP_RATIOS: [4, 4, 4, 4]
+        EMBED_DIMS: [64, 128, 320, 512]
diff --git a/image_classification/CycleMLP/configs/cyclemlp_b3.yaml b/image_classification/CycleMLP/configs/cyclemlp_b3.yaml
new file mode 100644
index 00000000..76a0c15e
--- /dev/null
+++ b/image_classification/CycleMLP/configs/cyclemlp_b3.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: CycleMLP
+    NAME: cyclemlp_b3
+    MIXER:
+        TRANSITIONS: [True, True, True, True]
+        LAYERS: [3, 4, 18, 3]
+        MLP_RATIOS: [8, 8, 4, 4]
+        EMBED_DIMS: [64, 128, 320, 512]
diff --git a/image_classification/CycleMLP/configs/cyclemlp_b4.yaml b/image_classification/CycleMLP/configs/cyclemlp_b4.yaml
new file mode 100644
index 00000000..327b5982
--- /dev/null
+++ b/image_classification/CycleMLP/configs/cyclemlp_b4.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CycleMLP
+    NAME: cyclemlp_b4
+    MIXER:
+        TRANSITIONS: [True, True, True, True]
+        LAYERS: [3, 8, 27, 3]
+        MLP_RATIOS: [8, 8, 4, 4]
+        EMBED_DIMS: [64, 128, 320, 512]
diff --git a/image_classification/CycleMLP/configs/cyclemlp_b5.yaml b/image_classification/CycleMLP/configs/cyclemlp_b5.yaml
new file mode 100644
index 00000000..d497be49
--- /dev/null
+++ b/image_classification/CycleMLP/configs/cyclemlp_b5.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: CycleMLP
+    NAME: cyclemlp_b5
+    MIXER:
+        TRANSITIONS: [True, True, True, True]
+        LAYERS: [3, 4, 24, 3]
+        MLP_RATIOS: [4, 4, 4, 4]
+        EMBED_DIMS: [96, 192, 384, 768]
diff --git a/image_classification/CycleMLP/cyclemlp.png b/image_classification/CycleMLP/cyclemlp.png
new file mode 100644
index 00000000..9aa2054f
Binary files /dev/null and b/image_classification/CycleMLP/cyclemlp.png differ
diff --git a/image_classification/CycleMLP/cyclemlp.py b/image_classification/CycleMLP/cyclemlp.py
new file mode 100644
index 00000000..ff857904
--- /dev/null
+++ b/image_classification/CycleMLP/cyclemlp.py
@@ -0,0 +1,462 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for CycleMLP
+"""
+
+import os
+import math
+import paddle
+import paddle.nn as nn
+from paddle import Tensor
+from paddle.vision.ops import deform_conv2d
+import paddle.nn.functional as F
+from droppath import DropPath
+
+
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+kaiming_uniform_ = nn.initializer.KaimingUniform()
+
+
+class Identity(nn.Layer):
+    """Identity layer
+    This is does nothing but passing the input as output
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, inputs):
+        return inputs
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+    
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.0):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class CycleFC(nn.Layer):
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size,  # re-defined kernel_size, represent the spatial area of staircase FC
+                 stride: int = 1,
+                 padding: int = 0,
+                 dilation: int = 1,
+                 groups: int = 1,
+                 bias: bool = True):
+        super(CycleFC, self).__init__()
+
+        if in_channels % groups != 0:
+            raise ValueError("in_channels must be divisible by groups")
+        if out_channels % groups != 0:
+            raise ValueError("out_channels must be divisible by groups")
+        if stride != 1:
+            raise ValueError("stride must be 1")
+        if padding != 0:
+            raise ValueError("padding must be 0")
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.stride = (stride, stride)
+        self.padding = (padding, padding)
+        self.dilation = (dilation, dilation)
+        self.groups = groups
+
+        self.weight = self.create_parameter(
+            shape=[out_channels, in_channels // groups, 1, 1],
+            default_initializer=kaiming_uniform_,
+        )  # kernel size == 1
+
+        if bias:
+            bound = 1 / math.sqrt(self.weight.shape[1])
+            self.bias = self.create_parameter(
+                shape=[out_channels],
+                default_initializer=nn.initializer.Uniform(-bound, bound),
+            )
+        else:
+            self.bias = None
+        self.register_buffer("offset", self.gen_offset())
+
+    def gen_offset(self):
+        """
+        offset (Tensor[batch_size, 2 * offset_groups * kernel_height * kernel_width,
+            out_height, out_width]): offsets to be applied for each position in the
+            convolution kernel.
+        """
+        offset = paddle.empty([1, self.in_channels * 2, 1, 1])
+        start_idx = (self.kernel_size[0] * self.kernel_size[1]) // 2
+        assert self.kernel_size[0] == 1 or self.kernel_size[1] == 1, self.kernel_size
+        for i in range(self.in_channels):
+            if self.kernel_size[0] == 1:
+                offset[0, 2 * i + 0, 0, 0] = 0
+                offset[0, 2 * i + 1, 0, 0] = (i + start_idx) % self.kernel_size[1] - (
+                    self.kernel_size[1] // 2
+                )
+            else:
+                offset[0, 2 * i + 0, 0, 0] = (i + start_idx) % self.kernel_size[0] - (
+                    self.kernel_size[0] // 2
+                )
+                offset[0, 2 * i + 1, 0, 0] = 0
+        return offset
+
+    def forward(self, inputs: Tensor) -> Tensor:
+        """
+        Args:
+            input (Tensor[batch_size, in_channels, in_height, in_width]): input tensor
+        """
+        B, C, H, W = inputs.shape
+        deformable_groups = self.offset.shape[1] // (
+            2 * self.weight.shape[2] * self.weight.shape[3])
+
+        return deform_conv2d(inputs,
+                             self.offset.expand([B, -1, H, W]),
+                             self.weight,
+                             self.bias,
+                             stride=self.stride,
+                             padding=self.padding,
+                             dilation=self.dilation,
+                             deformable_groups=deformable_groups)
+
+    def extra_repr(self) -> str:
+        s = self.__class__.__name__ + "("
+        s += "{in_channels}"
+        s += ", {out_channels}"
+        s += ", kernel_size={kernel_size}"
+        s += ", stride={stride}"
+        s += ", padding={padding}" if self.padding != (0, 0) else ""
+        s += ", dilation={dilation}" if self.dilation != (1, 1) else ""
+        s += ", groups={groups}" if self.groups != 1 else ""
+        s += ", bias=False" if self.bias is None else ""
+        s += ")"
+        return s.format(**self.__dict__)
+
+
+class CycleMLP(nn.Layer):
+    def __init__(self,
+                 dim,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0.0,
+                 proj_drop=0.0):
+        super().__init__()
+        self.mlp_c = nn.Linear(dim, dim, bias_attr=qkv_bias)
+
+        self.sfc_h = CycleFC(dim, dim, (1, 3), 1, 0)
+        self.sfc_w = CycleFC(dim, dim, (3, 1), 1, 0)
+
+        self.reweight = Mlp(dim, dim // 4, dim * 3)
+
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, H, W, C = x.shape
+        h = self.sfc_h(x.transpose([0, 3, 1, 2])).transpose([0, 2, 3, 1])
+        w = self.sfc_w(x.transpose([0, 3, 1, 2])).transpose([0, 2, 3, 1])
+        c = self.mlp_c(x)
+
+        a = (h + w + c).transpose([0, 3, 1, 2]).flatten(2).mean(2)
+        a = F.softmax(self.reweight(a).reshape((B, C, 3)).transpose([2, 0, 1]), axis=0)
+        a = a.unsqueeze(2)
+        a = a.unsqueeze(2)
+
+        x = h * a[0] + w * a[1] + c * a[2]
+
+        x = self.proj(x)
+        x = self.proj_drop(x)
+
+        return x
+
+
+class CycleBlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 mlp_ratio=4.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.0,
+                 attn_drop=0.0,
+                 drop_path=0.0,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 skip_lam=1.0,
+                 mlp_fn=CycleMLP):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = mlp_fn(dim, qkv_bias=qkv_bias, qk_scale=None, attn_drop=attn_drop)
+
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer)
+        self.skip_lam = skip_lam
+
+    def forward(self, x):
+        x = x + self.drop_path(self.attn(self.norm1(x))) / self.skip_lam
+        x = x + self.drop_path(self.mlp(self.norm2(x))) / self.skip_lam
+        return x
+
+
+class PatchEmbedOverlapping(nn.Layer):
+    """2D Image to Patch Embedding with overlapping"""
+
+    def __init__(self,
+                 patch_size=16,
+                 stride=16,
+                 padding=0,
+                 in_chans=3,
+                 embed_dim=768,
+                 norm_layer=None,
+                 groups=1):
+        super().__init__()
+        patch_size = (patch_size, patch_size)
+        stride = (stride, stride)
+        padding = (padding, padding)
+        self.patch_size = patch_size
+        # remove image_size in model init to support dynamic image size
+
+        self.proj = nn.Conv2D(in_chans,
+                              embed_dim,
+                              kernel_size=patch_size,
+                              stride=stride,
+                              padding=padding,
+                              groups=groups)
+        self.norm = norm_layer(embed_dim) if norm_layer else Identity()
+
+    def forward(self, x):
+        x = self.proj(x)
+        return x
+
+
+class Downsample(nn.Layer):
+    """Downsample transition stage"""
+    def __init__(self, in_embed_dim, out_embed_dim, patch_size):
+        super().__init__()
+        assert patch_size == 2, patch_size
+        self.proj = nn.Conv2D(in_embed_dim,
+                              out_embed_dim,
+                              kernel_size=(3, 3),
+                              stride=(2, 2),
+                              padding=1)
+
+    def forward(self, x):
+        x = x.transpose([0, 3, 1, 2])
+        x = self.proj(x)  # B, C, H, W
+        x = x.transpose([0, 2, 3, 1])
+        return x
+
+
+
+
+def basic_blocks(dim,
+                 index,
+                 layers,
+                 mlp_ratio=3.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0.0,
+                 drop_path_rate=0.0,
+                 skip_lam=1.0,
+                 mlp_fn=CycleMLP,
+                 **kwargs):
+    blocks = []
+
+    for block_idx in range(layers[index]):
+        block_dpr = (
+            drop_path_rate * (block_idx + sum(layers[:index])) / (sum(layers) - 1)
+        )
+        blocks.append(
+            CycleBlock(
+                dim,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                attn_drop=attn_drop,
+                drop_path=block_dpr,
+                skip_lam=skip_lam,
+                mlp_fn=mlp_fn,
+            )
+        )
+    blocks = nn.Sequential(*blocks)
+
+    return blocks
+
+
+class CycleNet(nn.Layer):
+    """CycleMLP Network"""
+    def __init__(self,
+                 layers,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dims=None,
+                 transitions=None,
+                 segment_dim=None,
+                 mlp_ratios=None,
+                 skip_lam=1.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop_rate=0.0,
+                 attn_drop_rate=0.0,
+                 drop_path_rate=0.0,
+                 norm_layer=nn.LayerNorm,
+                 mlp_fn=CycleMLP,
+                 fork_feat=False):
+        super().__init__()
+        if not fork_feat:
+            self.num_classes = num_classes
+        self.fork_feat = fork_feat
+        self.patch_embed = PatchEmbedOverlapping(patch_size=7,
+                                                 stride=4,
+                                                 padding=2,
+                                                 in_chans=3,
+                                                 embed_dim=embed_dims[0])
+        network = []
+        for i in range(len(layers)):
+            stage = basic_blocks(dim=embed_dims[i],
+                                 index=i,
+                                 layers=layers,
+                                 mlp_ratio=mlp_ratios[i],
+                                 qkv_bias=qkv_bias,
+                                 qk_scale=qk_scale,
+                                 attn_drop=attn_drop_rate,
+                                 drop_path_rate=drop_path_rate,
+                                 norm_layer=norm_layer,
+                                 skip_lam=skip_lam,
+                                 mlp_fn=mlp_fn)
+            network.append(stage)
+            if i >= len(layers) - 1:
+                break
+            if transitions[i] or embed_dims[i] != embed_dims[i + 1]:
+                patch_size = 2 if transitions[i] else 1
+                network.append(Downsample(embed_dims[i], embed_dims[i + 1], patch_size))
+
+        self.network = nn.LayerList(network)
+
+        if self.fork_feat:
+            # add a norm layer for each output
+            self.out_indices = [0, 2, 4, 6]
+            for i_emb, i_layer in enumerate(self.out_indices):
+                if i_emb == 0 and os.environ.get("FORK_LAST3", None):
+                    layer = Identity()
+                else:
+                    layer = norm_layer(embed_dims[i_emb])
+                layer_name = f"norm{i_layer}"
+                self.add_layer(layer_name, layer)
+        else:
+            # Classifier head
+            self.norm = norm_layer(embed_dims[-1])
+            self.head = nn.Linear(embed_dims[-1], num_classes) if num_classes > 0 else Identity()
+        self.apply(self.cls_init_weights)
+
+    def cls_init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                zeros_(m.bias)
+        elif isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+        elif isinstance(m, CycleFC):
+            trunc_normal_(m.weight)
+            zeros_(m.bias)
+
+    def get_classifier(self):
+        return self.head
+
+    def reset_classifier(self, num_classes, global_pool=""):
+        self.num_classes = num_classes
+        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else Identity()
+
+    def forward_embeddings(self, x):
+        x = self.patch_embed(x)
+        # B,C,H,W-> B,H,W,C
+        x = x.transpose([0, 2, 3, 1])
+        return x
+
+    def forward_tokens(self, x):
+        outs = []
+        for idx, block in enumerate(self.network):
+            x = block(x)
+            if self.fork_feat and idx in self.out_indices:
+                norm_layer = getattr(self, f"norm{idx}")
+                x_out = norm_layer(x)
+                outs.append(x_out.transpose([0, 3, 1, 2]))
+        if self.fork_feat:
+            return outs
+
+        B, H, W, C = x.shape
+        x = x.reshape([B, -1, C])
+        return x
+
+    def forward(self, x):
+        x = self.forward_embeddings(x)
+        # B, H, W, C -> B, N, C
+        x = self.forward_tokens(x)
+        if self.fork_feat:
+            return x
+
+        x = self.norm(x)
+        cls_out = self.head(x.mean(1))
+        return cls_out
+
+
+def build_cyclemlp(config):
+    '''build cyclemlp model'''
+    model = CycleNet(num_classes=config.MODEL.NUM_CLASSES,
+                     layers=config.MODEL.MIXER.LAYERS,
+                     embed_dims=config.MODEL.MIXER.EMBED_DIMS,
+                     patch_size=7,
+                     transitions=config.MODEL.MIXER.TRANSITIONS,
+                     mlp_ratios=config.MODEL.MIXER.MLP_RATIOS,
+                     mlp_fn=CycleMLP)
+    return model
diff --git a/image_classification/CycleMLP/datasets.py b/image_classification/CycleMLP/datasets.py
new file mode 100644
index 00000000..304df9a3
--- /dev/null
+++ b/image_classification/CycleMLP/datasets.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/CycleMLP/droppath.py b/image_classification/CycleMLP/droppath.py
new file mode 100644
index 00000000..c8fe8048
--- /dev/null
+++ b/image_classification/CycleMLP/droppath.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/CycleMLP/losses.py b/image_classification/CycleMLP/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/CycleMLP/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/CycleMLP/main_multi_gpu.py b/image_classification/CycleMLP/main_multi_gpu.py
new file mode 100644
index 00000000..6e9cb2a3
--- /dev/null
+++ b/image_classification/CycleMLP/main_multi_gpu.py
@@ -0,0 +1,581 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""CycleMLP training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from cyclemlp import build_cyclemlp as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CycleMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg
+        train_acc_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CycleMLP/main_single_gpu.py b/image_classification/CycleMLP/main_single_gpu.py
new file mode 100644
index 00000000..36b55c1c
--- /dev/null
+++ b/image_classification/CycleMLP/main_single_gpu.py
@@ -0,0 +1,423 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""CycleMLP training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from cyclemlp import build_cyclemlp as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('CycleMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/CycleMLP/mixup.py b/image_classification/CycleMLP/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/CycleMLP/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/CycleMLP/random_erasing.py b/image_classification/CycleMLP/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/CycleMLP/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/CycleMLP/run_eval.sh b/image_classification/CycleMLP/run_eval.sh
new file mode 100644
index 00000000..e2f3d049
--- /dev/null
+++ b/image_classification/CycleMLP/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/cyclemlp_b5.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./cyclemlp_b5'
diff --git a/image_classification/CycleMLP/run_eval_multi.sh b/image_classification/CycleMLP/run_eval_multi.sh
new file mode 100644
index 00000000..fcc53bc7
--- /dev/null
+++ b/image_classification/CycleMLP/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg="./configs/cyclemlp_b4.yaml" \
+-dataset="imagenet2012" \
+-batch_size=128 \
+-data_path="/dataset/imagenet" \
+-eval \
+-pretrained="./cyclemlp_b4"
diff --git a/image_classification/CycleMLP/run_train.sh b/image_classification/CycleMLP/run_train.sh
new file mode 100644
index 00000000..9ad9560f
--- /dev/null
+++ b/image_classification/CycleMLP/run_train.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/cyclemlp_b1.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/CycleMLP/run_train_multi.sh b/image_classification/CycleMLP/run_train_multi.sh
new file mode 100644
index 00000000..d7cab80d
--- /dev/null
+++ b/image_classification/CycleMLP/run_train_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/cyclemlp_b1.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/CycleMLP/transforms.py b/image_classification/CycleMLP/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/CycleMLP/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/CycleMLP/utils.py b/image_classification/CycleMLP/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/CycleMLP/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/DeiT/README.md b/image_classification/DeiT/README.md
index 208667aa..d844a538 100644
--- a/image_classification/DeiT/README.md
+++ b/image_classification/DeiT/README.md
@@ -13,13 +13,17 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 </p>
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): More weights are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| deit_base_distilled_patch16_224| 83.32  | 96.49 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/12_x6-NN3Jde2BFUih4OM9NlTwe9-Xlkw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ZnmAWgT6ewe7Vl3Xw_csuA)(5f2g) |
-| deit_base_distilled_patch16_384| 85.43  | 97.33 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1i5H_zjSdHfM-Znv89DHTv9ChykWrIt8I/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PQsQIci4VCHY7l2tCzMklg)(qgj2) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| deit_tiny_distilled_224   	| 74.52 | 91.90 | 5.9M    | 1.1G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1fku9-11O_gQI7UpZTjagVeND-pcHbV0C/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1hAQ_85wWkqQ7sIGO1CmO9g)(rhda) |
+| deit_small_distilled_224  	| 81.17 | 95.41 | 22.4M   | 4.3G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1RIeWTdf5o6pwkjqN4NbW91GZSOCalI5t/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wCVrukvwxISAGGjorPw3iw)(pv28) |
+| deit_base_distilled_224  		| 83.32 | 96.49 | 87.2M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/12_x6-NN3Jde2BFUih4OM9NlTwe9-Xlkw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ZnmAWgT6ewe7Vl3Xw_csuA)(5f2g) |
+| deit_base_distilled_384  		| 85.43 | 97.33 | 87.2M   | 49.9G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1i5H_zjSdHfM-Znv89DHTv9ChykWrIt8I/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PQsQIci4VCHY7l2tCzMklg)(qgj2) |
+
 
 | Teacher Model | Link |
 | -- | -- |
@@ -69,8 +73,8 @@ from deit import build_deit as build_model
 config = get_config('./configs/deit_base_patch16_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./deit_base_patch16_224')
+# load pretrained weights
+model_state_dict = paddle.load('./deit_base_patch16_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -83,12 +87,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/deit_base_patch16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/deit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/imagenet/val/dataset/val \
     -eval \
-    -pretrained='./deit_base_patch16_224'
+    -pretrained=/path/to/pretrained/model/deit_base_patch16_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -105,12 +109,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/deit_base_patch16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/deit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./deit_base_patch16_224'
+    -pretrained=/path/to/pretrained/model/deit_base_patch16_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -127,11 +131,11 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/deit_base_patch16_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/deit_base_patch16_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
-  -teacher_model='./regnety_160'
+  -data_path=/path/to/dataset/imagenet/train \
+  -teacher_model=/path/to/pretrained/model/regnety_160  # .pdparams is NOT needed
 ```
 
 <details>
@@ -148,11 +152,11 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/deit_base_patch16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/deit_base_patch16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
-    -teacher_model='./regnety_160'
+    -data_path=/path/to/dataset/imagenet/train \
+    -teacher_model=/path/to/pretrained/model/regnety_160  # .pdparams is NOT needed
 ```
 
 </details>
diff --git a/image_classification/DeiT/__init__.py b/image_classification/DeiT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/DeiT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/DeiT/augment.py b/image_classification/DeiT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/DeiT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/DeiT/auto_augment.py b/image_classification/DeiT/auto_augment.py
deleted file mode 100644
index a8daf02b..00000000
--- a/image_classification/DeiT/auto_augment.py
+++ /dev/null
@@ -1,223 +0,0 @@
-# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Auto Augmentation"""
-
-import random
-import numpy as np
-from PIL import Image, ImageEnhance, ImageOps
-
-
-def auto_augment_policy_original():
-    """ImageNet auto augment policy"""
-    policy = [
-        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],        
-        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],        
-        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],        
-        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],        
-        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],        
-        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],        
-        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],        
-        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],        
-        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],        
-        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],        
-        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],        
-        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],        
-        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],        
-        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],        
-        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],        
-        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],        
-        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],        
-        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],        
-        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],        
-        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],        
-        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],        
-        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],        
-        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],        
-        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],        
-        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],        
-    ]
-    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
-    return policy
-
-
-class AutoAugment():
-    """Auto Augment
-    Randomly choose a tuple of augment ops from a list of policy
-    Then apply the tuple of augment ops to input image
-    """
-    def __init__(self, policy):
-        self.policy = policy
-    
-    def __call__(self, image, policy_idx=None):
-        if policy_idx is None:
-            policy_idx = random.randint(0, len(self.policy)-1)
-
-        sub_policy = self.policy[policy_idx]
-        for op in sub_policy:
-            image = op(image)
-        return image
-
-
-class SubPolicy:
-    """Subpolicy
-    Read augment name and magnitude, apply augment with probability
-    Args:
-        op_name: str, augment operation name
-        prob: float, if prob > random prob, apply augment
-        magnitude_idx: int, index of magnitude in preset magnitude ranges
-    """
-    def __init__(self, op_name, prob, magnitude_idx):
-        # ranges of operations' magnitude
-        ranges = {
-            'ShearX': np.linspace(0, 0.3, 10), # [-0.3, 0.3] (by random negative)
-            'ShearY': np.linspace(0, 0.3, 10), # [-0.3, 0.3] (by random negative)
-            'TranslateX': np.linspace(0, 150 / 331, 10), #[-0.45, 0.45] (by random negative)
-            'TranslateY': np.linspace(0, 150 / 331, 10), #[-0.45, 0.45] (by random negative)
-            'Rotate': np.linspace(0, 30, 10), #[-30, 30] (by random negative)
-            'Color': np.linspace(0, 0.9, 10), #[-0.9, 0.9] (by random negative)
-            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int), #[0, 4]
-            'Solarize': np.linspace(256, 0, 10), #[0, 256]
-            'Contrast': np.linspace(0, 0.9, 10), #[-0.9, 0.9] (by random negative)
-            'Sharpness': np.linspace(0, 0.9, 10), #[-0.9, 0.9] (by random negative)
-            'Brightness': np.linspace(0, 0.9, 10), #[-0.9, 0.9] (by random negative)
-            'AutoContrast': [0] * 10, # no range
-            'Equalize': [0] * 10, # no range
-            'Invert': [0] * 10, # no range
-        }
-        
-        # augmentation operations 
-        # Lambda is not pickleable for DDP
-        #image_ops = {
-        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),   
-        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),   
-        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),   
-        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),   
-        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),   
-        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),   
-        #    'Invert': lambda image, magnitude: invert(image, magnitude),   
-        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),   
-        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),   
-        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),   
-        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),   
-        #    'Color': lambda image, magnitude: color(image, magnitude),   
-        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),   
-        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),   
-        #}
-        image_ops = {
-            'ShearX': shear_x,   
-            'ShearY': shear_y,   
-            'TranslateX': translate_x_relative,   
-            'TranslateY': translate_y_relative,   
-            'Rotate': rotate,   
-            'AutoContrast': auto_contrast,   
-            'Invert': invert,   
-            'Equalize': equalize,   
-            'Solarize': solarize,   
-            'Posterize': posterize,   
-            'Contrast': contrast,   
-            'Color': color,   
-            'Brightness': brightness,   
-            'Sharpness': sharpness,   
-        }
-
-        self.prob = prob
-        self.magnitude = ranges[op_name][magnitude_idx]
-        self.op = image_ops[op_name]
-
-    def __call__(self, image):
-        if self.prob > random.random():
-            image = self.op(image, self.magnitude)
-        return image
-
-
-# PIL Image transforms
-# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
-def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
-    factor = magnitude * random.choice([-1, 1]) # random negative
-    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
-
-
-def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
-    factor = magnitude * random.choice([-1, 1]) # random negative
-    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
-
-
-def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
-    pixels = magnitude * image.size[0]
-    pixels = pixels * random.choice([-1, 1]) # random negative
-    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
-
-
-def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
-    pixels = magnitude * image.size[0]
-    pixels = pixels * random.choice([-1, 1]) # random negative
-    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
-
-
-def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
-    magnitude = magnitude * random.choice([-1, 1]) # random negative
-    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
-
-
-def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
-    magnitude = magnitude * random.choice([-1, 1]) # random negative
-    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
-
-
-def rotate(image, magnitude):
-    rot = image.convert("RGBA").rotate(magnitude)
-    return Image.composite(rot,
-                           Image.new('RGBA', rot.size, (128, ) * 4),
-                           rot).convert(image.mode)
-
-
-def auto_contrast(image, magnitude=None):
-    return ImageOps.autocontrast(image)
-
-
-def invert(image, magnitude=None):
-    return ImageOps.invert(image)
-
-
-def equalize(image, magnitude=None):
-    return ImageOps.equalize(image)
-
-
-def solarize(image, magnitude):
-    return ImageOps.solarize(image, magnitude)
-
-
-def posterize(image, magnitude):
-    return ImageOps.posterize(image, magnitude)
-
-
-def contrast(image, magnitude):
-    magnitude = magnitude * random.choice([-1, 1]) # random negative
-    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
-
-
-def color(image, magnitude):
-    magnitude = magnitude * random.choice([-1, 1]) # random negative
-    return ImageEnhance.Color(image).enhance(1 + magnitude)
-
-
-def brightness(image, magnitude):
-    magnitude = magnitude * random.choice([-1, 1]) # random negative
-    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
-
-
-def sharpness(image, magnitude):
-    magnitude = magnitude * random.choice([-1, 1]) # random negative
-    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
-
diff --git a/image_classification/DeiT/config.py b/image_classification/DeiT/config.py
index 5bdcf9ea..799a614b 100644
--- a/image_classification/DeiT/config.py
+++ b/image_classification/DeiT/config.py
@@ -31,18 +31,21 @@
 _C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.IMAGE_CHANNELS = 3 # input image channels
 _C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads 
+_C.DATA.NUM_WORKERS = 1 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
 _C.MODEL.TYPE = 'DeiT'
 _C.MODEL.NAME = 'DeiT'
 _C.MODEL.RESUME = None
-_C.MODEL.RESUME_EMA = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
 _C.MODEL.ATTENTION_DROPOUT = 0.0
 
 # transformer settings
@@ -61,13 +64,16 @@
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 5e-4
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.WARMUP_EPOCHS = 5
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.0005
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99992
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -82,20 +88,21 @@
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
 # train augmentation
-_C.TRAIN.MIXUP_ALPHA = 0.8
-_C.TRAIN.CUTMIX_ALPHA = 1.0
-_C.TRAIN.CUTMIX_MINMAX = None
-_C.TRAIN.MIXUP_PROB = 1.0
-_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
-_C.TRAIN.MIXUP_MODE = 'batch'
+_C.TRAIN.MIXUP_ALPHA = 0.8 # mixup alpha, enabled if >0
+_C.TRAIN.CUTMIX_ALPHA = 1.0 # cutmix alpha, enabled if >0
+_C.TRAIN.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.TRAIN.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.TRAIN.MIXUP_MODE = 'batch' # how to apply mixup/cutmix params, per 'batch', 'pair' or 'elem'
 
 _C.TRAIN.SMOOTHING = 0.1
-_C.TRAIN.COLOR_JITTER = 0.4
-_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
 
-_C.TRAIN.RANDOM_ERASE_PROB = 0.25
-_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
-_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
 _C.TRAIN.RANDOM_ERASE_SPLIT = False
 
 _C.TRAIN.DISTILLATION_TYPE = 'hard' # hard, soft, none 
@@ -103,17 +110,15 @@
 _C.TRAIN.DISTILLATION_TAU = 1.0
 _C.TRAIN.TEACHER_MODEL = './regnety_160' # no ext is needed
 
-_C.TRAIN.MODEL_EMA = True
-_C.TRAIN.MODEL_EMA_DECAY = 0.99996 
-
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 5 # freq to save chpt
-_C.REPORT_FREQ = 100 # freq to logging info
-_C.VALIDATE_FREQ = 100 # freq to do validation
-_C.SEED = 0
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 42
 _C.EVAL = False # run evaluation only
+_C.AMP = False
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -147,8 +152,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -160,9 +169,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
-    if args.teacher_model:
-        config.TRAIN.TEACHER_MODEL = args.teacher_model
-
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
     #config.freeze()
     return config
 
diff --git a/image_classification/DeiT/configs/deit_base_patch16_224.yaml b/image_classification/DeiT/configs/deit_base_patch16_224.yaml
index dd0f608d..28220114 100644
--- a/image_classification/DeiT/configs/deit_base_patch16_224.yaml
+++ b/image_classification/DeiT/configs/deit_base_patch16_224.yaml
@@ -14,10 +14,11 @@ MODEL:
 TRAIN:
     NUM_EPOCHS: 300
     WARMUP_EPOCHS: 5
-    WEIGHT_DECAY: 0.3
-    BASE_LR: 0.003
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 0.0005
     WARMUP_START_LR: 1e-6
     END_LR: 5e-4
     ACCUM_ITER: 2
+    LINEAR_SCALED_LR: 512
 
 
diff --git a/image_classification/DeiT/configs/deit_small_patch16_224.yaml b/image_classification/DeiT/configs/deit_small_patch16_224.yaml
new file mode 100644
index 00000000..8aa973c8
--- /dev/null
+++ b/image_classification/DeiT/configs/deit_small_patch16_224.yaml
@@ -0,0 +1,24 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: DeiT
+    NAME: deit_small_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 6
+        QKV_BIAS: True
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 0.0005
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+    LINEAR_SCALED_LR: 512
+
+
diff --git a/image_classification/DeiT/configs/deit_tiny_patch16_224.yaml b/image_classification/DeiT/configs/deit_tiny_patch16_224.yaml
new file mode 100644
index 00000000..272d33b6
--- /dev/null
+++ b/image_classification/DeiT/configs/deit_tiny_patch16_224.yaml
@@ -0,0 +1,24 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: DeiT
+    NAME: deit_tiny_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 3
+        QKV_BIAS: True
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 5e-4
+    WARMUP_START_LR: 1e-6
+    END_LR: 1e-5
+    ACCUM_ITER: 1
+    LINEAR_SCALED_LR: 512
+
+
diff --git a/image_classification/DeiT/datasets.py b/image_classification/DeiT/datasets.py
index 067bbe39..984e1fcf 100644
--- a/image_classification/DeiT/datasets.py
+++ b/image_classification/DeiT/datasets.py
@@ -20,12 +20,20 @@
 import os
 import math
 from PIL import Image
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
-from auto_augment import auto_augment_policy_original
-from auto_augment import AutoAugment
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
 from random_erasing import RandomErasing
 
+
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
 
@@ -93,13 +101,17 @@ def get_train_transforms(config):
         policy = auto_augment_policy_original()
         auto_augment = AutoAugment(policy)
         aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
     else:
         jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
-        aug_op_list.append(transforms.ColorJitter(jitter))
+        aug_op_list.append(transforms.ColorJitter(*jitter))
     # other ops
     aug_op_list.append(transforms.ToTensor())
-    aug_op_list.append(transforms.Normalize(mean=[0.485, 0.456, 0.406],
-                                            std=[0.229, 0.224, 0.225]))
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
     # random erasing
     if config.TRAIN.RANDOM_ERASE_PROB > 0.:
         random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
@@ -131,8 +143,7 @@ def get_val_transforms(config):
         transforms.Resize(scale_size, 'bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/DeiT/deit.py b/image_classification/DeiT/deit.py
index 995440c9..1508f62a 100644
--- a/image_classification/DeiT/deit.py
+++ b/image_classification/DeiT/deit.py
@@ -21,6 +21,7 @@
 import numpy as np
 import paddle
 import paddle.nn as nn
+from droppath import DropPath
 
 
 class Identity(nn.Layer):
@@ -100,8 +101,8 @@ def __init__(self, in_features, hidden_features, dropout=0.):
         self.dropout = nn.Dropout(dropout)
     
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
-        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
     
     def forward(self, x):
@@ -139,12 +140,25 @@ def __init__(self,
         self.dim_head = dim // num_heads
         self.scale = qk_scale or self.dim_head ** -0.5
 
-        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(dim,
+                             dim * 3,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else False)
         self.attn_dropout = nn.Dropout(attention_dropout)
         self.softmax = nn.Softmax(axis=-1)
-        self.proj = nn.Linear(dim, dim)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.proj = nn.Linear(dim,
+                              dim,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
         self.proj_dropout = nn.Dropout(dropout)
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def transpose_multihead(self, x):
         new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
         x = x.reshape(new_shape)
@@ -195,17 +209,30 @@ def __init__(self,
                  attention_dropout=0,
                  droppath=0.):
         super().__init__()
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.norm1 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_1,
+                                  bias_attr=b_attr_1,
+                                  epsilon=1e-6)
         self.attn = Attention(dim,
                               num_heads=num_heads,
                               qkv_bias=qkv_bias,
                               qk_scale=qk_scale,
                               attention_dropout=attention_dropout)
         self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
-        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.norm2 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_2,
+                                  bias_attr=b_attr_2,
+                                  epsilon=1e-6)
         self.mlp = Mlp(in_features=dim,
                        hidden_features=int(dim * mlp_ratio))
     
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         h = x
         x = self.norm1(x)
@@ -267,10 +294,32 @@ def __init__(self,
                                        qkv_bias=qkv_bias,
                                        attention_dropout=attention_dropout,
                                        droppath=droppath)) for _ in range(depth)])
-        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights_norm()
+        self.norm = nn.LayerNorm(embed_dim,
+                                 weight_attr=w_attr_1,
+                                 bias_attr=b_attr_1,
+                                 epsilon=1e-6)
+
+        w_attr_2, b_attr_2 = self._init_weights_linear()
+        self.head = nn.Linear(embed_dim,
+                              num_classes,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
+        w_attr_3, b_attr_3 = self._init_weights_linear()
+        self.head_distill = nn.Linear(embed_dim,
+                                      num_classes, 
+                                      weight_attr=w_attr_3,
+                                      bias_attr=b_attr_3)
+
+    def _init_weights_linear(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
-        self.head = nn.Linear(embed_dim, num_classes)
-        self.head_distill = nn.Linear(embed_dim, num_classes) 
+    def _init_weights_norm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
     def forward_features(self, x):
         x = self.patch_embed(x)
@@ -300,9 +349,15 @@ def forward(self, x):
 def build_deit(config):
     """build deit model using config"""
     model = Deit(image_size=config.DATA.IMAGE_SIZE,
-                 depth=config.MODEL.TRANS.DEPTH,
+                 in_channels=config.MODEL.TRANS.IN_CHANNELS,
+                 num_classes=config.MODEL.NUM_CLASSES,
+                 patch_size=config.MODEL.TRANS.PATCH_SIZE,
                  embed_dim=config.MODEL.TRANS.EMBED_DIM,
-                 mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
                  num_heads=config.MODEL.TRANS.NUM_HEADS,
-                 qkv_bias=config.MODEL.TRANS.QKV_BIAS)
+                 depth=config.MODEL.TRANS.DEPTH,
+                 mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+                 qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                 dropout=config.MODEL.DROPOUT,
+                 attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+                 droppath=config.MODEL.DROPPATH)
     return model
diff --git a/image_classification/DeiT/droppath.py b/image_classification/DeiT/droppath.py
new file mode 100644
index 00000000..d7ecf00c
--- /dev/null
+++ b/image_classification/DeiT/droppath.py
@@ -0,0 +1,61 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    out = dp(tmp)
+#    print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/DeiT/losses.py b/image_classification/DeiT/losses.py
index 082467a3..04377eac 100644
--- a/image_classification/DeiT/losses.py
+++ b/image_classification/DeiT/losses.py
@@ -21,29 +21,50 @@
 class LabelSmoothingCrossEntropyLoss(nn.Layer):
     """ cross entropy loss for label smoothing
     Args:
-        smoothing: float, smoothing rate
-        x: tensor, predictions (before softmax) with shape [N, num_classes]
-        target: tensor, target label with shape [N]
+        smoothing: float, label smoothing rate
+        x: tensor, predictions (default is before softmax) with shape [N, num_classes] as default
+        target: tensor, target label with shape [N] as default
+        weight: tensor, optional, a manual rescaling weight given to each class        
+        reduction: str, optional, indicate how to average the loss by batch_size,
+                   default is ``'mean'``, the candicates are ``'none'`` | ``'mean'`` | ``'sum'``
+        axis: int, optional, the index of dimension to perform softmax calculations,
+                   default is ``-1``, if `axis` is not -1 -> the shape of x and target may not be default
+        use_softmax: bool, optional, if `use_softmax` is ``False``, ``x`` should be after softmax,
+                     default is ``True``, the candicates are ``True`` | ``False``
+        name: str, optional, the name of the operator, default is ``None``,
+              for more information, please refer to :ref:`api_guide_Name`.
     Return:
         loss: float, cross entropy loss value
     """
-    def __init__(self, smoothing=0.1):
+    def __init__(self,
+                 smoothing=0.1,
+                 weight=None,                 
+                 reduction='mean',                 
+                 axis=-1,
+                 use_softmax=True,
+                 name=None):
         super().__init__()
         assert 0 <= smoothing < 1.0
         self.smoothing = smoothing
-        self.confidence = 1 - smoothing
+        self.weight = weight
+        self.reduction = reduction        
+        self.axis = axis
+        self.use_softmax = use_softmax
+        self.name = name
 
     def forward(self, x, target):
-        log_probs = F.log_softmax(x) # [N, num_classes]
-        # target_index is used to get prob for each of the N samples
-        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
-        target_index[:, 0] = paddle.arange(x.shape[0])
-        target_index[:, 1] = target
-
-        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
-        smooth_loss = -log_probs.mean(axis=-1)
-        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
-        return loss.mean()
+        target = paddle.nn.functional.one_hot(target, num_classes=x.shape[1])
+        target = paddle.nn.functional.label_smooth(target, epsilon=self.smoothing)        
+        loss = paddle.nn.functional.cross_entropy(
+            x,
+            target,            
+            weight=self.weight,            
+            reduction=self.reduction,
+            soft_label=True,
+            axis=self.axis,
+            use_softmax=self.use_softmax,
+            name=self.name)
+        return loss
 
 
 class SoftTargetCrossEntropyLoss(nn.Layer):
diff --git a/image_classification/DeiT/main_eval_regnet_multi_gpu.py b/image_classification/DeiT/main_eval_regnet_multi_gpu.py
index 5de3dc52..ca615fd4 100644
--- a/image_classification/DeiT/main_eval_regnet_multi_gpu.py
+++ b/image_classification/DeiT/main_eval_regnet_multi_gpu.py
@@ -38,6 +38,7 @@
 parser.add_argument('-batch_size', type=int, default=None)
 parser.add_argument('-image_size', type=int, default=None)
 parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-output', type=str, default=None)
 parser.add_argument('-ngpus', type=int, default=None)
 parser.add_argument('-pretrained', type=str, default=None)
 parser.add_argument('-resume', type=str, default=None)
diff --git a/image_classification/DeiT/main_multi_gpu.py b/image_classification/DeiT/main_multi_gpu.py
index 4e59321b..1dab0690 100644
--- a/image_classification/DeiT/main_multi_gpu.py
+++ b/image_classification/DeiT/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,10 +27,9 @@
 import paddle.distributed as dist
 from datasets import get_dataloader
 from datasets import get_dataset
-from deit import build_deit as build_model
-from regnet import build_regnet as build_teacher_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
 from mixup import Mixup
@@ -38,47 +37,48 @@
 from losses import SoftTargetCrossEntropyLoss
 from losses import DistillationLoss
 from model_ema import ModelEma
+from deit import build_deit as build_model
+from regnet import build_regnet as build_teacher_model
 
 
-parser = argparse.ArgumentParser('DeiT')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-teacher_model', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('DeiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-teacher_model', type=str, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -86,30 +86,45 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
           accum_iter=1,
           model_ema=None,
-          mixup_fn=None):
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
         model_ema: ModelEma, model moving average instance
-        mixup_fn: Mixup, mixup instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
@@ -119,23 +134,30 @@ def train(dataloader,
 
         if mixup_fn is not None:
             image, label = mixup_fn(image, label_orig)
-
-        output = model(image)
-        loss = criterion(image, output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
-
-        if model_ema is not None and paddle.distributed.get_rank() == 0:
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image) # output[0]: class_token, output[1]: distill_token
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image) # output[0]: class_token, output[1]: distill_token
+            loss = criterion(image, output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None and dist.get_rank() == 0:
             model_ema.update(model)
 
         # average of output and kd_output, like model eval mode
@@ -145,39 +167,77 @@ def train(dataloader,
         else:
             acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-    train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -192,60 +252,104 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, "+
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
-    # 8. Define model ema
+    # define model ema
     model_ema = None
-    if not config.EVAL: # only apply when training
-        if config.TRAIN.MODEL_EMA and local_rank == 0:
-            model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
 
-    # 3. Define mixup function
+    # STEP 3: Define Mixup function
     mixup_fn = None
     if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
         mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
@@ -254,36 +358,60 @@ def main_worker(*args):
                          prob=config.TRAIN.MIXUP_PROB,
                          switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
                          mode=config.TRAIN.MIXUP_MODE,
-                         label_smoothing=config.TRAIN.SMOOTHING)
-    # 4. Define criterion
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
     if config.TRAIN.MIXUP_PROB > 0.:
         criterion = SoftTargetCrossEntropyLoss()
     elif config.TRAIN.SMOOTHING:
         criterion = LabelSmoothingCrossEntropyLoss()
     else:
         criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
 
-    val_criterion = nn.CrossEntropyLoss()
 
     # 5. Create Teacher model
     teacher_model = None
     if not config.EVAL:
         if config.TRAIN.DISTILLATION_TYPE != 'none':
-            logging.info(f'Creating teacher model: {config.TRAIN.TEACHER_MODEL}')
+            local_logger.info(f'Creating teacher model: {config.TRAIN.TEACHER_MODEL}')
             teacher_model = build_teacher_model()
             assert os.path.isfile(config.TRAIN.TEACHER_MODEL + '.pdparams')
             teacher_model_state = paddle.load(config.TRAIN.TEACHER_MODEL + '.pdparams')
             teacher_model.set_dict(teacher_model_state)
             teacher_model.eval()
-            logger.info(f"----- Load teacher model state from {config.TRAIN.TEACHER_MODEL}")
-        # wrap the criterion:
-        criterion = DistillationLoss(criterion,
-                                     teacher_model,
-                                     config.TRAIN.DISTILLATION_TYPE,
-                                     config.TRAIN.DISTILLATION_ALPHA,
-                                     config.TRAIN.DISTILLATION_TAU)
-
-    # 6. Define optimizer and lr_scheduler
+            teacher_model = paddle.DataParallel(teacher_model)
+            local_logger.info(f"----- Load teacher model state from {config.TRAIN.TEACHER_MODEL}")
+            # wrap the criterion:
+            criterion = DistillationLoss(criterion,
+                                         teacher_model,
+                                         config.TRAIN.DISTILLATION_TYPE,
+                                         config.TRAIN.DISTILLATION_ALPHA,
+                                         config.TRAIN.DISTILLATION_TAU)
+        else:
+            raise ValueError('Distillation type cannot be None')
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -305,7 +433,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -332,85 +462,132 @@ def main_worker(*args):
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 7. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
         # load ema model
-        if model_ema is not None and os.path.isfile(config.MODEL.RESUME_EMA+'.pdparams'):
-            model_ema_state = paddle.load(config.MODEL.RESUME_EMA+'.pdparams')
-            model_ema.set_dict(model_ema_state)
-            logger.info(f"----- Load model ema from {config.MODEL.RESUME_EMA}")
-    # 8. Validation
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=val_criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  model_ema=model_ema,
-                                                  mixup_fn=mixup_fn)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=val_criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -418,21 +595,38 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
                 if model_ema is not None:
                     model_ema_path = os.path.join(
                         config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
                     paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
-                    logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 def main():
-    # Build dataset
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/DeiT/main_single_gpu.py b/image_classification/DeiT/main_single_gpu.py
index 0834dc11..5ea51051 100644
--- a/image_classification/DeiT/main_single_gpu.py
+++ b/image_classification/DeiT/main_single_gpu.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,10 +27,9 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from deit import build_deit as build_model
-from regnet import build_regnet as build_teacher_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
 from mixup import Mixup
@@ -38,49 +37,48 @@
 from losses import SoftTargetCrossEntropyLoss
 from losses import DistillationLoss
 from model_ema import ModelEma
+from deit import build_deit as build_model
+from regnet import build_regnet as build_teacher_model
 
 
-parser = argparse.ArgumentParser('DeiT')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-teacher_model', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('DeiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-teacher_model', type=str, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -88,32 +86,42 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
           accum_iter=1,
           model_ema=None,
-          mixup_fn=None):
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
-        mode_ema: ModelEma, model moving average instance
-        mixup_fn: Mixup, mixup instance
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
+
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
@@ -121,20 +129,28 @@ def train(dataloader,
 
         if mixup_fn is not None:
             image, label = mixup_fn(image, label_orig)
-
-        output = model(image)
-        loss = criterion(image, output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image) # output[0]: class_token, output[1]: distill_token
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image) # output[0]: class_token, output[1]: distill_token
+            loss = criterion(image, output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         if model_ema is not None:
             model_ema.update(model)
@@ -150,9 +166,9 @@ def train(dataloader,
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -161,19 +177,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -198,7 +215,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -210,26 +227,42 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    # 2. Define model ema
+    # define model ema
     model_ema = None
-    if not config.EVAL:# only apply ema when training
-        if config.TRAIN.MODEL_EMA:
-            model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
-    # 3. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 4. Define mixup function
+
+
+    # STEP 3: Define Mixup function
     mixup_fn = None
     if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
         mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
@@ -238,34 +271,59 @@ def main():
                          prob=config.TRAIN.MIXUP_PROB,
                          switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
                          mode=config.TRAIN.MIXUP_MODE,
-                         label_smoothing=config.TRAIN.SMOOTHING)
-    # 5. Define criterion
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
     if config.TRAIN.MIXUP_PROB > 0.:
         criterion = SoftTargetCrossEntropyLoss()
     elif config.TRAIN.SMOOTHING:
         criterion = LabelSmoothingCrossEntropyLoss()
     else:
         criterion = nn.CrossEntropyLoss()
-    # only use cross entropy for val 
-    val_criterion = nn.CrossEntropyLoss()
-    # 6. Create Teacher model
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Create Teacher model
     teacher_model = None
     if not config.EVAL:
         if config.TRAIN.DISTILLATION_TYPE != 'none':
-            logging.info(f'Creating teacher model: {config.TRAIN.TEACHER_MODEL}')
+            logger.info(f'Creating teacher model: {config.TRAIN.TEACHER_MODEL}')
             teacher_model = build_teacher_model() 
             assert os.path.isfile(config.TRAIN.TEACHER_MODEL + '.pdparams')
             teacher_model_state = paddle.load(config.TRAIN.TEACHER_MODEL + '.pdparams')
             teacher_model.set_dict(teacher_model_state)
             teacher_model.eval()
             logger.info(f"----- Load teacher model state from {config.TRAIN.TEACHER_MODEL}")
-        # wrap the criterion:
-        criterion = DistillationLoss(criterion,
-                                     teacher_model,
-                                     config.TRAIN.DISTILLATION_TYPE,
-                                     config.TRAIN.DISTILLATION_ALPHA,
-                                     config.TRAIN.DISTILLATION_TAU)
-    # 7. Define lr_scheduler
+            # wrap the criterion:
+            criterion = DistillationLoss(criterion,
+                                         teacher_model,
+                                         config.TRAIN.DISTILLATION_TYPE,
+                                         config.TRAIN.DISTILLATION_ALPHA,
+                                         config.TRAIN.DISTILLATION_TAU)
+        else:
+            logger.fatal('Distillation type cannot be None')
+            raise ValueError('Distillation type cannot be None')
+
+    # STEP 6: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -287,9 +345,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 8. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -309,64 +367,76 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
-            grad_clip=clip)
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 9. Load pretrained model or load resume model and optimizer states
+
+    # STEP 7: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
-        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
-        optimizer.set_dict(opt_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
-        if model_ema is not None and os.path.isfile(config.MODEL.RESUME_EMA+'.pdparams'):
-            model_ema_state = paddle.load(config.MODEL.RESUME_EMA+'.pdparams')
-            model_ema.set_dict(model_ema_state)
-            logger.info(f"----- Load model ema from {config.MODEL.RESUME_EMA}")
-
-    # 10. Validation
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 8: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=val_criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 10. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 9: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
                                                   model_ema=model_ema,
-                                                  mixup_fn=mixup_fn)
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -378,9 +448,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=val_criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
@@ -394,12 +465,12 @@ def main():
             paddle.save(optimizer.state_dict(), model_path + '.pdopt')
             logger.info(f"----- Save model: {model_path}.pdparams")
             logger.info(f"----- Save optim: {model_path}.pdopt")
-            # save model ema
             if model_ema is not None:
                 model_ema_path = os.path.join(
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
                 paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
                 logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
+
 if __name__ == "__main__":
     main()
diff --git a/image_classification/DeiT/model_ema.py b/image_classification/DeiT/model_ema.py
index 389ab685..8a636765 100644
--- a/image_classification/DeiT/model_ema.py
+++ b/image_classification/DeiT/model_ema.py
@@ -56,3 +56,6 @@ def update(self, model):
     def set(self, model):
         self._update(model, update_fn=lambda e, m: m)
 
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/DeiT/nohup.out b/image_classification/DeiT/nohup.out
new file mode 100644
index 00000000..8ba00866
--- /dev/null
+++ b/image_classification/DeiT/nohup.out
@@ -0,0 +1,10513 @@
+merging config from ./configs/deit_tiny_patch16_224.yaml
+----- Imagenet2012 image train list len = 1281167
+----- Imagenet2012 image val list len = 50000
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:10052', '127.0.0.1:16573', '127.0.0.1:58903']
+I1019 12:36:27.022415 26349 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:10052 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:16573', '127.0.0.1:58903']
+I1019 12:36:29.410305 26372 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:16573 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:58903']
+I1019 12:36:31.798696 26398 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:58903 successful.
+I1019 12:36:33.241878 26349 nccl_context.cc:74] init nccl context nranks: 4 local rank: 1 gpu id: 1 ring id: 0
+I1019 12:36:33.241895 26325 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
+I1019 12:36:33.241914 26372 nccl_context.cc:74] init nccl context nranks: 4 local rank: 2 gpu id: 2 ring id: 0
+I1019 12:36:33.241920 26398 nccl_context.cc:74] init nccl context nranks: 4 local rank: 3 gpu id: 3 ring id: 0
+W1019 12:36:35.282066 26325 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 12:36:35.282845 26398 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 12:36:35.282909 26372 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 12:36:35.282928 26349 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 12:36:35.340312 26325 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+W1019 12:36:35.340323 26349 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1019 12:36:35.340323 26398 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1019 12:36:35.340334 26372 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 4, local_rank = 2
+INFO:master_logger:
+AMP: True
+AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 256
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 2
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.0
+  DROPPATH: 0.1
+  NAME: deit_tiny_patch16_224
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  RESUME_EMA: None
+  TRANS:
+    DEPTH: 12
+    EMBED_DIM: 192
+    INIT_VALUES: 1e-05
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: 3
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: DeiT
+NGPUS: 4
+REPORT_FREQ: 50
+SAVE: ./output/train-20211019-12-36-17
+SAVE_FREQ: 1
+SEED: 42
+TAG: default
+TRAIN:
+  ACCUM_ITER: 1
+  AUTO_AUGMENT: True
+  BASE_LR: 0.0005
+  COLOR_JITTER: 0.4
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  DISTILLATION_ALPHA: 0.5
+  DISTILLATION_TAU: 1.0
+  DISTILLATION_TYPE: hard
+  END_LR: 1e-05
+  GRAD_CLIP: None
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  MODEL_EMA: True
+  MODEL_EMA_DECAY: 0.99996
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RANDOM_ERASE_COUNT: 1
+  RANDOM_ERASE_MODE: pixel
+  RANDOM_ERASE_PROB: 0.25
+  RANDOM_ERASE_SPLIT: False
+  SMOOTHING: 0.1
+  TEACHER_MODEL: ./regnety_160
+  WARMUP_EPOCHS: 5
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 10
+INFO:local_logger:----- world_size = 4, local_rank = 0
+INFO:master_logger:----- world_size = 4, local_rank = 0
+INFO:local_logger:----- world_size = 4, local_rank = 1
+INFO:local_logger:----- world_size = 4, local_rank = 3
+INFO:local_logger:----- Total # of train batch (single gpu): 1252
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1252
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:master_logger:----- Total # of train batch (single gpu): 1252
+INFO:master_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1252
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1252
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000101
+INFO:master_logger:Now training epoch 1. LR=0.000101
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000101
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000101
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000101
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1252], Avg Loss: 7.0747, Avg Acc: 0.0000
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1252], Avg Loss: 7.0167, Avg Acc: 0.0039
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1252], Avg Loss: 7.0979, Avg Acc: 0.0000
+INFO:master_logger:Epoch[001/300], Step[0000/1252], Avg Loss: 7.0672, Avg Acc: 0.0010
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1252], Avg Loss: 7.0795, Avg Acc: 0.0000
+INFO:local_logger:Epoch[001/300], Step[0050/1252], Avg Loss: 6.9937, Avg Acc: 0.0012
+INFO:local_logger:Epoch[001/300], Step[0050/1252], Avg Loss: 7.0011, Avg Acc: 0.0011
+INFO:master_logger:Epoch[001/300], Step[0050/1252], Avg Loss: 6.9968, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0050/1252], Avg Loss: 6.9972, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0050/1252], Avg Loss: 6.9953, Avg Acc: 0.0010
+INFO:local_logger:Epoch[001/300], Step[0100/1252], Avg Loss: 6.9623, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0100/1252], Avg Loss: 6.9632, Avg Acc: 0.0011
+INFO:local_logger:Epoch[001/300], Step[0100/1252], Avg Loss: 6.9641, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0100/1252], Avg Loss: 6.9661, Avg Acc: 0.0009
+INFO:master_logger:Epoch[001/300], Step[0100/1252], Avg Loss: 6.9639, Avg Acc: 0.0012
+INFO:local_logger:Epoch[001/300], Step[0150/1252], Avg Loss: 6.9439, Avg Acc: 0.0012
+INFO:local_logger:Epoch[001/300], Step[0150/1252], Avg Loss: 6.9460, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0150/1252], Avg Loss: 6.9459, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0150/1252], Avg Loss: 6.9476, Avg Acc: 0.0009
+INFO:master_logger:Epoch[001/300], Step[0150/1252], Avg Loss: 6.9459, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0200/1252], Avg Loss: 6.9338, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0200/1252], Avg Loss: 6.9357, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0200/1252], Avg Loss: 6.9349, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0200/1252], Avg Loss: 6.9358, Avg Acc: 0.0011
+INFO:master_logger:Epoch[001/300], Step[0200/1252], Avg Loss: 6.9350, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0250/1252], Avg Loss: 6.9268, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0250/1252], Avg Loss: 6.9270, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0250/1252], Avg Loss: 6.9279, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0250/1252], Avg Loss: 6.9263, Avg Acc: 0.0014
+INFO:master_logger:Epoch[001/300], Step[0250/1252], Avg Loss: 6.9270, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0300/1252], Avg Loss: 6.9198, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0300/1252], Avg Loss: 6.9217, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0300/1252], Avg Loss: 6.9206, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0300/1252], Avg Loss: 6.9204, Avg Acc: 0.0018
+INFO:master_logger:Epoch[001/300], Step[0300/1252], Avg Loss: 6.9206, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0350/1252], Avg Loss: 6.9156, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0350/1252], Avg Loss: 6.9140, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0350/1252], Avg Loss: 6.9146, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0350/1252], Avg Loss: 6.9145, Avg Acc: 0.0018
+INFO:master_logger:Epoch[001/300], Step[0350/1252], Avg Loss: 6.9147, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0400/1252], Avg Loss: 6.9083, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0400/1252], Avg Loss: 6.9080, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0400/1252], Avg Loss: 6.9082, Avg Acc: 0.0020
+INFO:master_logger:Epoch[001/300], Step[0400/1252], Avg Loss: 6.9085, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0400/1252], Avg Loss: 6.9098, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0450/1252], Avg Loss: 6.9015, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0450/1252], Avg Loss: 6.9023, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0450/1252], Avg Loss: 6.9012, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0450/1252], Avg Loss: 6.9026, Avg Acc: 0.0018
+INFO:master_logger:Epoch[001/300], Step[0450/1252], Avg Loss: 6.9019, Avg Acc: 0.0019
+INFO:local_logger:Epoch[001/300], Step[0500/1252], Avg Loss: 6.8946, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0500/1252], Avg Loss: 6.8959, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0500/1252], Avg Loss: 6.8948, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0500/1252], Avg Loss: 6.8963, Avg Acc: 0.0021
+INFO:master_logger:Epoch[001/300], Step[0500/1252], Avg Loss: 6.8954, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0550/1252], Avg Loss: 6.8874, Avg Acc: 0.0022
+INFO:local_logger:Epoch[001/300], Step[0550/1252], Avg Loss: 6.8896, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0550/1252], Avg Loss: 6.8882, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0550/1252], Avg Loss: 6.8898, Avg Acc: 0.0022
+INFO:master_logger:Epoch[001/300], Step[0550/1252], Avg Loss: 6.8888, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0600/1252], Avg Loss: 6.8801, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0600/1252], Avg Loss: 6.8809, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0600/1252], Avg Loss: 6.8826, Avg Acc: 0.0023
+INFO:local_logger:Epoch[001/300], Step[0600/1252], Avg Loss: 6.8820, Avg Acc: 0.0023
+INFO:master_logger:Epoch[001/300], Step[0600/1252], Avg Loss: 6.8814, Avg Acc: 0.0023
+INFO:local_logger:Epoch[001/300], Step[0650/1252], Avg Loss: 6.8726, Avg Acc: 0.0025
+INFO:local_logger:Epoch[001/300], Step[0650/1252], Avg Loss: 6.8742, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0650/1252], Avg Loss: 6.8752, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0650/1252], Avg Loss: 6.8738, Avg Acc: 0.0023
+INFO:master_logger:Epoch[001/300], Step[0650/1252], Avg Loss: 6.8739, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0700/1252], Avg Loss: 6.8644, Avg Acc: 0.0027
+INFO:local_logger:Epoch[001/300], Step[0700/1252], Avg Loss: 6.8658, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0700/1252], Avg Loss: 6.8673, Avg Acc: 0.0026
+INFO:local_logger:Epoch[001/300], Step[0700/1252], Avg Loss: 6.8663, Avg Acc: 0.0025
+INFO:master_logger:Epoch[001/300], Step[0700/1252], Avg Loss: 6.8659, Avg Acc: 0.0025
+INFO:local_logger:Epoch[001/300], Step[0750/1252], Avg Loss: 6.8557, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[0750/1252], Avg Loss: 6.8571, Avg Acc: 0.0026
+INFO:local_logger:Epoch[001/300], Step[0750/1252], Avg Loss: 6.8582, Avg Acc: 0.0028
+INFO:local_logger:Epoch[001/300], Step[0750/1252], Avg Loss: 6.8579, Avg Acc: 0.0027
+INFO:master_logger:Epoch[001/300], Step[0750/1252], Avg Loss: 6.8572, Avg Acc: 0.0028
+INFO:local_logger:Epoch[001/300], Step[0800/1252], Avg Loss: 6.8478, Avg Acc: 0.0031
+INFO:master_logger:Epoch[001/300], Step[0800/1252], Avg Loss: 6.8488, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[0800/1252], Avg Loss: 6.8483, Avg Acc: 0.0027
+INFO:local_logger:Epoch[001/300], Step[0800/1252], Avg Loss: 6.8494, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[0800/1252], Avg Loss: 6.8496, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[0850/1252], Avg Loss: 6.8403, Avg Acc: 0.0032
+INFO:local_logger:Epoch[001/300], Step[0850/1252], Avg Loss: 6.8406, Avg Acc: 0.0030
+INFO:local_logger:Epoch[001/300], Step[0850/1252], Avg Loss: 6.8391, Avg Acc: 0.0030
+INFO:local_logger:Epoch[001/300], Step[0850/1252], Avg Loss: 6.8409, Avg Acc: 0.0031
+INFO:master_logger:Epoch[001/300], Step[0850/1252], Avg Loss: 6.8402, Avg Acc: 0.0031
+INFO:local_logger:Epoch[001/300], Step[0900/1252], Avg Loss: 6.8315, Avg Acc: 0.0032
+INFO:local_logger:Epoch[001/300], Step[0900/1252], Avg Loss: 6.8319, Avg Acc: 0.0034
+INFO:local_logger:Epoch[001/300], Step[0900/1252], Avg Loss: 6.8300, Avg Acc: 0.0032
+INFO:local_logger:Epoch[001/300], Step[0900/1252], Avg Loss: 6.8317, Avg Acc: 0.0032
+INFO:master_logger:Epoch[001/300], Step[0900/1252], Avg Loss: 6.8313, Avg Acc: 0.0032
+INFO:local_logger:Epoch[001/300], Step[0950/1252], Avg Loss: 6.8234, Avg Acc: 0.0035
+INFO:local_logger:Epoch[001/300], Step[0950/1252], Avg Loss: 6.8222, Avg Acc: 0.0034
+INFO:local_logger:Epoch[001/300], Step[0950/1252], Avg Loss: 6.8236, Avg Acc: 0.0035
+INFO:local_logger:Epoch[001/300], Step[0950/1252], Avg Loss: 6.8217, Avg Acc: 0.0033
+INFO:master_logger:Epoch[001/300], Step[0950/1252], Avg Loss: 6.8227, Avg Acc: 0.0034
+INFO:local_logger:Epoch[001/300], Step[1000/1252], Avg Loss: 6.8126, Avg Acc: 0.0035
+INFO:local_logger:Epoch[001/300], Step[1000/1252], Avg Loss: 6.8143, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1000/1252], Avg Loss: 6.8143, Avg Acc: 0.0036
+INFO:local_logger:Epoch[001/300], Step[1000/1252], Avg Loss: 6.8143, Avg Acc: 0.0035
+INFO:master_logger:Epoch[001/300], Step[1000/1252], Avg Loss: 6.8139, Avg Acc: 0.0036
+INFO:local_logger:Epoch[001/300], Step[1050/1252], Avg Loss: 6.8049, Avg Acc: 0.0038
+INFO:local_logger:Epoch[001/300], Step[1050/1252], Avg Loss: 6.8040, Avg Acc: 0.0038
+INFO:local_logger:Epoch[001/300], Step[1050/1252], Avg Loss: 6.8063, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1050/1252], Avg Loss: 6.8050, Avg Acc: 0.0038
+INFO:master_logger:Epoch[001/300], Step[1050/1252], Avg Loss: 6.8051, Avg Acc: 0.0038
+INFO:local_logger:Epoch[001/300], Step[1100/1252], Avg Loss: 6.7951, Avg Acc: 0.0041
+INFO:local_logger:Epoch[001/300], Step[1100/1252], Avg Loss: 6.7974, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1100/1252], Avg Loss: 6.7961, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1100/1252], Avg Loss: 6.7961, Avg Acc: 0.0040
+INFO:master_logger:Epoch[001/300], Step[1100/1252], Avg Loss: 6.7962, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1150/1252], Avg Loss: 6.7870, Avg Acc: 0.0042
+INFO:local_logger:Epoch[001/300], Step[1150/1252], Avg Loss: 6.7871, Avg Acc: 0.0043
+INFO:master_logger:Epoch[001/300], Step[1150/1252], Avg Loss: 6.7871, Avg Acc: 0.0042
+INFO:local_logger:Epoch[001/300], Step[1150/1252], Avg Loss: 6.7877, Avg Acc: 0.0042
+INFO:local_logger:Epoch[001/300], Step[1150/1252], Avg Loss: 6.7865, Avg Acc: 0.0042
+INFO:local_logger:Epoch[001/300], Step[1200/1252], Avg Loss: 6.7789, Avg Acc: 0.0044
+INFO:local_logger:Epoch[001/300], Step[1200/1252], Avg Loss: 6.7779, Avg Acc: 0.0043
+INFO:local_logger:Epoch[001/300], Step[1200/1252], Avg Loss: 6.7786, Avg Acc: 0.0043
+INFO:local_logger:Epoch[001/300], Step[1200/1252], Avg Loss: 6.7785, Avg Acc: 0.0044
+INFO:master_logger:Epoch[001/300], Step[1200/1252], Avg Loss: 6.7785, Avg Acc: 0.0044
+INFO:local_logger:Epoch[001/300], Step[1250/1252], Avg Loss: 6.7715, Avg Acc: 0.0046
+INFO:local_logger:Epoch[001/300], Step[1250/1252], Avg Loss: 6.7707, Avg Acc: 0.0045
+INFO:local_logger:Epoch[001/300], Step[1250/1252], Avg Loss: 6.7702, Avg Acc: 0.0046
+INFO:local_logger:Epoch[001/300], Step[1250/1252], Avg Loss: 6.7706, Avg Acc: 0.0044
+INFO:master_logger:Epoch[001/300], Step[1250/1252], Avg Loss: 6.7708, Avg Acc: 0.0045
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7707, Train Acc: 0.0045, time: 2205.65
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7705, Train Acc: 0.0044, time: 2205.65
+INFO:local_logger:Now training epoch 2. LR=0.000201
+INFO:master_logger:----- Epoch[001/300], Train Loss: 6.7707, Train Acc: 0.0045, time: 2205.65
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7715, Train Acc: 0.0046, time: 2205.65
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7702, Train Acc: 0.0046, time: 2205.65
+INFO:local_logger:Now training epoch 2. LR=0.000201
+INFO:local_logger:Now training epoch 2. LR=0.000201
+INFO:master_logger:----- Save model: ./output/train-20211019-12-36-17/DeiT-Epoch-1-Loss-6.770543940747088.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-12-36-17/DeiT-Epoch-1-Loss-6.770543940747088.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-12-36-17/DeiT-Epoch-1-Loss-6.770543940747088-EMA.pdparams
+INFO:local_logger:Now training epoch 2. LR=0.000201
+INFO:master_logger:Now training epoch 2. LR=0.000201
+Traceback (most recent call last):
+  File "main_multi_gpu.py", line 619, in <module>
+    main()
+  File "main_multi_gpu.py", line 615, in main
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 2 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 542, in main_worker
+    master_logger=master_logger)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 139, in train
+    loss = criterion(image, output, label)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/losses.py", line 110, in forward
+    teacher_outputs = self.teacher_model(inputs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 695, in forward
+    outputs = self._layers(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 239, in forward
+    x = self.forward_features(x)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 235, in forward_features
+    x = stage(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 150, in forward
+    x = block(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 106, in forward
+    out = self.bn1(out)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/nn/layer/norm.py", line 653, in forward
+    use_global_stats=self._use_global_stats)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/nn/functional/norm.py", line 191, in batch_norm
+    *attrs)
+SystemError: (Fatal) Operator batch_norm raises an paddle::memory::allocation::BadAlloc exception.
+The exception content is
+:ResourceExhaustedError: 
+
+Out of memory error on GPU 2. Cannot allocate 1.339844GB memory on GPU 2, 14.467896GB memory has been allocated and available memory is only 1.313843GB.
+
+Please check whether there is any other process using GPU 2.
+1. If yes, please stop them, or start PaddlePaddle on another GPU.
+2. If no, please decrease the batch size of your model. 
+
+ (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
+. (at /paddle/paddle/fluid/imperative/tracer.cc:221)
+
+
+merging config from ./configs/deit_tiny_patch16_224.yaml
+----- Imagenet2012 image train list len = 1281167
+----- Imagenet2012 image val list len = 50000
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:54867', '127.0.0.1:51844', '127.0.0.1:60094']
+I1019 15:35:05.644764 10424 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:54867 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:51844', '127.0.0.1:60094']
+I1019 15:35:08.408504 10445 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:51844 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:60094']
+I1019 15:35:10.916846 10462 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:60094 successful.
+I1019 15:35:12.045945 10424 nccl_context.cc:74] init nccl context nranks: 4 local rank: 1 gpu id: 1 ring id: 0
+I1019 15:35:12.045961 10407 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
+I1019 15:35:12.045974 10445 nccl_context.cc:74] init nccl context nranks: 4 local rank: 2 gpu id: 2 ring id: 0
+I1019 15:35:12.045984 10462 nccl_context.cc:74] init nccl context nranks: 4 local rank: 3 gpu id: 3 ring id: 0
+W1019 15:35:14.493502 10462 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 15:35:14.493502 10445 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 15:35:14.493502 10424 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 15:35:14.493598 10407 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 15:35:14.500733 10424 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1019 15:35:14.500741 10445 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+W1019 15:35:14.500741 10462 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1019 15:35:14.500762 10407 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 4, local_rank = 1
+INFO:local_logger:----- world_size = 4, local_rank = 2
+INFO:master_logger:
+AMP: True
+AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 200
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 2
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.0
+  DROPPATH: 0.1
+  NAME: deit_tiny_patch16_224
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  RESUME_EMA: None
+  TRANS:
+    DEPTH: 12
+    EMBED_DIM: 192
+    INIT_VALUES: 1e-05
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: 3
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: DeiT
+NGPUS: 4
+REPORT_FREQ: 50
+SAVE: ./output/train-20211019-15-34-56
+SAVE_FREQ: 1
+SEED: 42
+TAG: default
+TRAIN:
+  ACCUM_ITER: 1
+  AUTO_AUGMENT: True
+  BASE_LR: 0.0005
+  COLOR_JITTER: 0.4
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  DISTILLATION_ALPHA: 0.5
+  DISTILLATION_TAU: 1.0
+  DISTILLATION_TYPE: hard
+  END_LR: 1e-05
+  GRAD_CLIP: None
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  MODEL_EMA: True
+  MODEL_EMA_DECAY: 0.99996
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RANDOM_ERASE_COUNT: 1
+  RANDOM_ERASE_MODE: pixel
+  RANDOM_ERASE_PROB: 0.25
+  RANDOM_ERASE_SPLIT: False
+  SMOOTHING: 0.1
+  TEACHER_MODEL: ./regnety_160
+  WARMUP_EPOCHS: 5
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 10
+INFO:local_logger:----- world_size = 4, local_rank = 0
+INFO:master_logger:----- world_size = 4, local_rank = 0
+INFO:local_logger:----- world_size = 4, local_rank = 3
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:master_logger:----- Total # of train batch (single gpu): 1602
+INFO:master_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:master_logger:Now training epoch 1. LR=0.000079
+ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ /opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ 
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+No stack trace in paddle, may be caused by external reasons.
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1634628940 (unix time) try "date -d @1634628940" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x2889) received by PID 10407 (TID 0x7f70fdb42700) from PID 10377 ***]
+
+Exception in thread Thread-1:
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data
+    data = self._data_queue.get(timeout=self._timeout)
+  File "/opt/conda/envs/py36/lib/python3.6/multiprocessing/queues.py", line 105, in get
+    raise Empty
+queue.Empty
+
+During handling of the above exception, another exception occurred:
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
+    self.run()
+  File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 864, in run
+    self._target(*self._args, **self._kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop
+    batch = self._get_data()
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data
+    "pids: {}".format(len(failed_workers), pids))
+RuntimeError: DataLoader 2 workers exit unexpectedly, pids: 10533, 10595
+
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+No stack trace in paddle, may be caused by external reasons.
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1634628941 (unix time) try "date -d @1634628941" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x2889) received by PID 10424 (TID 0x7f54cc002700) from PID 10377 ***]
+
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+Traceback (most recent call last):
+  File "main_multi_gpu.py", line 619, in <module>
+    main()
+  File "main_multi_gpu.py", line 615, in main
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 2 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 542, in main_worker
+    master_logger=master_logger)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 128, in train
+    for batch_id, data in enumerate(dataloader):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__
+    data = self._reader.read_next_var_list()
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/multiprocess_utils.py", line 134, in __handler__
+    core._throw_error_if_process_failed()
+SystemError: (Fatal) DataLoader process (pid   1. If run DataLoader by DataLoader.from_generator(...), queue capacity is set by from_generator(..., capacity=xx, ...).
+  2. If run DataLoader by DataLoader(dataset, ...), queue capacity is set as 2 times of the max value of num_workers and len(places).
+  3. If run by DataLoader(dataset, ..., use_shared_memory=True), set use_shared_memory=False for not using shared memory.) exited is killed by signal: 10593.
+  It may be caused by insufficient shared storage space. This problem usually occurs when using docker as a development environment.
+  Please use command `df -h` to check the storage space of `/dev/shm`. Shared storage space needs to be greater than (DataLoader Num * DataLoader queue capacity * 1 batch data size).
+  You can solve this problem by increasing the shared storage space or reducing the queue capacity appropriately.
+Bus error (at /paddle/paddle/fluid/imperative/data_loader.cc:177)
+
+
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+merging config from ./configs/deit_tiny_patch16_224.yaml
+----- Imagenet2012 image train list len = 1281167
+----- Imagenet2012 image val list len = 50000
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21409', '127.0.0.1:55908', '127.0.0.1:21120']
+I1019 16:30:08.941134 14913 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:21409 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:55908', '127.0.0.1:21120']
+I1019 16:30:11.628895 14928 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:55908 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:13.932492 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 1 times with reason: Address already in use retry after 0.5 seconds
+W1019 16:30:14.432631 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 2 times with reason: Address already in use retry after 1 seconds
+W1019 16:30:15.432929 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 3 times with reason: Address already in use retry after 1.5 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:16.933054 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 4 times with reason: Address already in use retry after 2 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:18.933182 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 5 times with reason: Address already in use retry after 2.5 seconds
+W1019 16:30:21.433308 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 6 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:24.433396 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 7 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:27.433518 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 8 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:30.433643 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 9 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:33.433766 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 10 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:36.433892 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 11 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:39.434017 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 12 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:42.434144 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 13 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:45.434275 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 14 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:48.434398 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 15 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:51.434523 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 16 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:54.434654 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 17 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:30:57.434777 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 18 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:00.434899 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 19 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:03.435021 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 20 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:06.435148 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 21 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:09.435268 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 22 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:12.435390 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 23 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:15.435514 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 24 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:18.435642 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 25 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:21.435760 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 26 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:24.435876 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 27 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:27.435999 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 28 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:30.436127 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 29 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:33.436254 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 30 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:36.436391 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 31 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:39.436513 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 32 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:42.436633 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 33 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:45.436761 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 34 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:48.436884 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 35 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+W1019 16:31:51.437003 14945 gen_comm_id_helper.cc:129] bind addr=127.0.0.1:21120 failed 36 times with reason: Address already in use retry after 3 seconds
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:21120']
+I1019 16:31:54.437127 14945 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:21120 successful.
+I1019 16:31:54.781250 14928 nccl_context.cc:74] init nccl context nranks: 4 local rank: 2 gpu id: 2 ring id: 0
+I1019 16:31:54.781248 14913 nccl_context.cc:74] init nccl context nranks: 4 local rank: 1 gpu id: 1 ring id: 0
+I1019 16:31:54.781278 14945 nccl_context.cc:74] init nccl context nranks: 4 local rank: 3 gpu id: 3 ring id: 0
+I1019 16:31:54.781304 14896 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
+W1019 16:31:56.982764 14896 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 16:31:56.983193 14928 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 16:31:56.985651 14945 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 16:31:56.985658 14913 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 16:31:56.987751 14928 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+W1019 16:31:56.989671 14896 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+W1019 16:31:56.990126 14945 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1019 16:31:56.990132 14913 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 4, local_rank = 2
+INFO:master_logger:
+AMP: True
+AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 200
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 1
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.0
+  DROPPATH: 0.1
+  NAME: deit_tiny_patch16_224
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  RESUME_EMA: None
+  TRANS:
+    DEPTH: 12
+    EMBED_DIM: 192
+    INIT_VALUES: 1e-05
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: 3
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: DeiT
+NGPUS: 4
+REPORT_FREQ: 50
+SAVE: ./output/train-20211019-16-29-59
+SAVE_FREQ: 1
+SEED: 42
+TAG: default
+TRAIN:
+  ACCUM_ITER: 1
+  AUTO_AUGMENT: True
+  BASE_LR: 0.0005
+  COLOR_JITTER: 0.4
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  DISTILLATION_ALPHA: 0.5
+  DISTILLATION_TAU: 1.0
+  DISTILLATION_TYPE: hard
+  END_LR: 1e-05
+  GRAD_CLIP: None
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  MODEL_EMA: True
+  MODEL_EMA_DECAY: 0.99996
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RANDOM_ERASE_COUNT: 1
+  RANDOM_ERASE_MODE: pixel
+  RANDOM_ERASE_PROB: 0.25
+  RANDOM_ERASE_SPLIT: False
+  SMOOTHING: 0.1
+  TEACHER_MODEL: ./regnety_160
+  WARMUP_EPOCHS: 5
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 10
+INFO:local_logger:----- world_size = 4, local_rank = 0
+INFO:master_logger:----- world_size = 4, local_rank = 0
+INFO:local_logger:----- world_size = 4, local_rank = 1
+INFO:local_logger:----- world_size = 4, local_rank = 3
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:master_logger:----- Total # of train batch (single gpu): 1602
+INFO:master_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:master_logger:Now training epoch 1. LR=0.000079
+ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ /opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ /opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+Traceback (most recent call last):
+  File "main_multi_gpu.py", line 619, in <module>
+    main()
+  File "main_multi_gpu.py", line 615, in main
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 1 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 542, in main_worker
+    master_logger=master_logger)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 139, in train
+    loss = criterion(image, output, label)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/losses.py", line 110, in forward
+    teacher_outputs = self.teacher_model(inputs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 695, in forward
+    outputs = self._layers(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 239, in forward
+    x = self.forward_features(x)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 235, in forward_features
+    x = stage(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 150, in forward
+    x = block(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 113, in forward
+    out = self.se(out)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/regnet.py", line 35, in forward
+    out = self.conv1_1x1(out)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 892, in __call__
+    with param_guard(self._parameters), param_guard(self._buffers):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/decorator.py", line 232, in fun
+    return caller(func, *(extras + args), **kw)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/wrapped_decorator.py", line 24, in __impl__
+    wrapped_func = decorator_func(func)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/multiprocess_utils.py", line 134, in __handler__
+    core._throw_error_if_process_failed()
+SystemError: (Fatal) DataLoader process (pid   1. If run DataLoader by DataLoader.from_generator(...), queue capacity is set by from_generator(..., capacity=xx, ...).
+  2. If run DataLoader by DataLoader(dataset, ...), queue capacity is set as 2 times of the max value of num_workers and len(places).
+  3. If run by DataLoader(dataset, ..., use_shared_memory=True), set use_shared_memory=False for not using shared memory.) exited is killed by signal: 15116.
+  It may be caused by insufficient shared storage space. This problem usually occurs when using docker as a development environment.
+  Please use command `df -h` to check the storage space of `/dev/shm`. Shared storage space needs to be greater than (DataLoader Num * DataLoader queue capacity * 1 batch data size).
+  You can solve this problem by increasing the shared storage space or reducing the queue capacity appropriately.
+Bus error (at /paddle/paddle/fluid/imperative/data_loader.cc:177)
+
+
+merging config from ./configs/deit_tiny_patch16_224.yaml
+----- Imagenet2012 image train list len = 1281167
+----- Imagenet2012 image val list len = 50000
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:12329', '127.0.0.1:26276', '127.0.0.1:45068']
+I1019 17:26:50.063747 19357 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:12329 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:26276', '127.0.0.1:45068']
+I1019 17:26:52.344933 19375 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:26276 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:45068']
+I1019 17:26:55.107314 19390 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:45068 successful.
+I1019 17:26:56.396499 19375 nccl_context.cc:74] init nccl context nranks: 4 local rank: 2 gpu id: 2 ring id: 0
+I1019 17:26:56.396499 19357 nccl_context.cc:74] init nccl context nranks: 4 local rank: 1 gpu id: 1 ring id: 0
+I1019 17:26:56.396505 19390 nccl_context.cc:74] init nccl context nranks: 4 local rank: 3 gpu id: 3 ring id: 0
+I1019 17:26:56.396515 19341 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
+W1019 17:26:58.250305 19341 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:26:58.250298 19357 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:26:58.252547 19375 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:26:58.252678 19390 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:26:58.255209 19357 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1019 17:26:58.255213 19341 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+W1019 17:26:58.256739 19375 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+W1019 17:26:58.257006 19390 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 4, local_rank = 1
+INFO:master_logger:
+AMP: True
+AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 200
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 1
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.0
+  DROPPATH: 0.1
+  NAME: deit_tiny_patch16_224
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  RESUME_EMA: None
+  TRANS:
+    DEPTH: 12
+    EMBED_DIM: 192
+    INIT_VALUES: 1e-05
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: 3
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: DeiT
+NGPUS: 4
+REPORT_FREQ: 50
+SAVE: ./output/train-20211019-17-26-40
+SAVE_FREQ: 1
+SEED: 42
+TAG: default
+TRAIN:
+  ACCUM_ITER: 1
+  AUTO_AUGMENT: True
+  BASE_LR: 0.0005
+  COLOR_JITTER: 0.4
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  DISTILLATION_ALPHA: 0.5
+  DISTILLATION_TAU: 1.0
+  DISTILLATION_TYPE: hard
+  END_LR: 1e-05
+  GRAD_CLIP: None
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  MODEL_EMA: True
+  MODEL_EMA_DECAY: 0.99996
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RANDOM_ERASE_COUNT: 1
+  RANDOM_ERASE_MODE: pixel
+  RANDOM_ERASE_PROB: 0.25
+  RANDOM_ERASE_SPLIT: False
+  SMOOTHING: 0.1
+  TEACHER_MODEL: ./regnety_160
+  WARMUP_EPOCHS: 5
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 10
+INFO:local_logger:----- world_size = 4, local_rank = 0
+INFO:master_logger:----- world_size = 4, local_rank = 0
+INFO:local_logger:----- world_size = 4, local_rank = 3
+INFO:local_logger:----- world_size = 4, local_rank = 2
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:master_logger:----- Total # of train batch (single gpu): 1602
+INFO:master_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:master_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ Exception in thread Thread-1:
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data
+    data = self._data_queue.get(timeout=self._timeout)
+  File "/opt/conda/envs/py36/lib/python3.6/multiprocessing/queues.py", line 105, in get
+    raise Empty
+queue.Empty
+
+During handling of the above exception, another exception occurred:
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
+    self.run()
+  File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 864, in run
+    self._target(*self._args, **self._kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop
+    batch = self._get_data()
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data
+    "pids: {}".format(len(failed_workers), pids))
+RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 19442
+
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+Traceback (most recent call last):
+  File "main_multi_gpu.py", line 619, in <module>
+    main()
+  File "main_multi_gpu.py", line 615, in main
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 2 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 542, in main_worker
+    master_logger=master_logger)
+  File "/workspace/ppvit_github/PaddleViT_Train/PaddleViT/image_classification/DeiT/main_multi_gpu.py", line 128, in train
+    for batch_id, data in enumerate(dataloader):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 697, in __next__
+    data = self._reader.read_next_var_list()
+SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
+  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)
+
+
+merging config from ./configs/deit_tiny_patch16_224.yaml
+----- Imagenet2012 image train list len = 1281167
+----- Imagenet2012 image val list len = 50000
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:15092', '127.0.0.1:10405', '127.0.0.1:36726']
+I1019 17:32:50.348035 19967 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:15092 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:10405', '127.0.0.1:36726']
+I1019 17:32:52.781280 19982 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:10405 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:36726']
+I1019 17:32:54.970005 19999 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:36726 successful.
+I1019 17:32:56.910451 19967 nccl_context.cc:74] init nccl context nranks: 4 local rank: 1 gpu id: 1 ring id: 0
+I1019 17:32:56.910456 19982 nccl_context.cc:74] init nccl context nranks: 4 local rank: 2 gpu id: 2 ring id: 0
+I1019 17:32:56.910470 19950 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
+I1019 17:32:56.910496 19999 nccl_context.cc:74] init nccl context nranks: 4 local rank: 3 gpu id: 3 ring id: 0
+W1019 17:32:58.500495 19950 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:32:58.500913 19967 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:32:58.500931 19982 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:32:58.501097 19999 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1019 17:32:58.505395 19982 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+W1019 17:32:58.505403 19967 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1019 17:32:58.505429 19999 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1019 17:32:58.505970 19950 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 4, local_rank = 1
+INFO:master_logger:
+AMP: True
+AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 200
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 1
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.0
+  DROPPATH: 0.1
+  NAME: deit_tiny_patch16_224
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  RESUME_EMA: None
+  TRANS:
+    DEPTH: 12
+    EMBED_DIM: 192
+    INIT_VALUES: 1e-05
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: 3
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: DeiT
+NGPUS: 4
+REPORT_FREQ: 50
+SAVE: ./output/train-20211019-17-32-41
+SAVE_FREQ: 1
+SEED: 42
+TAG: default
+TRAIN:
+  ACCUM_ITER: 1
+  AUTO_AUGMENT: True
+  BASE_LR: 0.0005
+  COLOR_JITTER: 0.4
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  DISTILLATION_ALPHA: 0.5
+  DISTILLATION_TAU: 1.0
+  DISTILLATION_TYPE: hard
+  END_LR: 1e-05
+  GRAD_CLIP: None
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  MODEL_EMA: True
+  MODEL_EMA_DECAY: 0.99996
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RANDOM_ERASE_COUNT: 1
+  RANDOM_ERASE_MODE: pixel
+  RANDOM_ERASE_PROB: 0.25
+  RANDOM_ERASE_SPLIT: False
+  SMOOTHING: 0.1
+  TEACHER_MODEL: ./regnety_160
+  WARMUP_EPOCHS: 5
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 10
+INFO:local_logger:----- world_size = 4, local_rank = 0
+INFO:master_logger:----- world_size = 4, local_rank = 0
+INFO:local_logger:----- world_size = 4, local_rank = 2
+INFO:local_logger:----- world_size = 4, local_rank = 3
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Total # of train batch (single gpu): 1602
+INFO:local_logger:----- Total # of val batch (single gpu): 1563
+INFO:master_logger:----- Total # of train batch (single gpu): 1602
+INFO:master_logger:----- Total # of val batch (single gpu): 1563
+INFO:local_logger:Creating teacher model: ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+INFO:master_logger:Now training epoch 1. LR=0.000079
+INFO:local_logger:----- Load teacher model state from ./regnety_160
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000079
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1602], Avg Loss: 7.0708, Avg Acc: 0.0000
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1602], Avg Loss: 7.1262, Avg Acc: 0.0000
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1602], Avg Loss: 7.0283, Avg Acc: 0.0000
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float16, but right dtype is paddle.float32, the right dtype will convert to paddle.float16
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/math_op_patch.py:248: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.float16, the right dtype will convert to paddle.float32
+  format(lhs_dtype, rhs_dtype, lhs_dtype))
+INFO:local_logger:Epoch[001/300], Step[0000/1602], Avg Loss: 7.0642, Avg Acc: 0.0000
+INFO:master_logger:Epoch[001/300], Step[0000/1602], Avg Loss: 7.0724, Avg Acc: 0.0000
+INFO:local_logger:Epoch[001/300], Step[0050/1602], Avg Loss: 7.0130, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0050/1602], Avg Loss: 7.0076, Avg Acc: 0.0011
+INFO:local_logger:Epoch[001/300], Step[0050/1602], Avg Loss: 7.0059, Avg Acc: 0.0009
+INFO:master_logger:Epoch[001/300], Step[0050/1602], Avg Loss: 7.0113, Avg Acc: 0.0011
+INFO:local_logger:Epoch[001/300], Step[0050/1602], Avg Loss: 7.0189, Avg Acc: 0.0010
+INFO:local_logger:Epoch[001/300], Step[0100/1602], Avg Loss: 6.9807, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0100/1602], Avg Loss: 6.9762, Avg Acc: 0.0011
+INFO:local_logger:Epoch[001/300], Step[0100/1602], Avg Loss: 6.9792, Avg Acc: 0.0013
+INFO:master_logger:Epoch[001/300], Step[0100/1602], Avg Loss: 6.9797, Avg Acc: 0.0012
+INFO:local_logger:Epoch[001/300], Step[0100/1602], Avg Loss: 6.9829, Avg Acc: 0.0009
+INFO:local_logger:Epoch[001/300], Step[0150/1602], Avg Loss: 6.9615, Avg Acc: 0.0011
+INFO:local_logger:Epoch[001/300], Step[0150/1602], Avg Loss: 6.9606, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0150/1602], Avg Loss: 6.9580, Avg Acc: 0.0011
+INFO:local_logger:Epoch[001/300], Step[0150/1602], Avg Loss: 6.9586, Avg Acc: 0.0014
+INFO:master_logger:Epoch[001/300], Step[0150/1602], Avg Loss: 6.9597, Avg Acc: 0.0012
+INFO:local_logger:Epoch[001/300], Step[0200/1602], Avg Loss: 6.9460, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0200/1602], Avg Loss: 6.9465, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0200/1602], Avg Loss: 6.9474, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0200/1602], Avg Loss: 6.9480, Avg Acc: 0.0013
+INFO:master_logger:Epoch[001/300], Step[0200/1602], Avg Loss: 6.9470, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0250/1602], Avg Loss: 6.9377, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0250/1602], Avg Loss: 6.9382, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0250/1602], Avg Loss: 6.9394, Avg Acc: 0.0012
+INFO:local_logger:Epoch[001/300], Step[0250/1602], Avg Loss: 6.9375, Avg Acc: 0.0013
+INFO:master_logger:Epoch[001/300], Step[0250/1602], Avg Loss: 6.9382, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0300/1602], Avg Loss: 6.9307, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0300/1602], Avg Loss: 6.9324, Avg Acc: 0.0013
+INFO:local_logger:Epoch[001/300], Step[0300/1602], Avg Loss: 6.9310, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0300/1602], Avg Loss: 6.9319, Avg Acc: 0.0015
+INFO:master_logger:Epoch[001/300], Step[0300/1602], Avg Loss: 6.9315, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0350/1602], Avg Loss: 6.9255, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0350/1602], Avg Loss: 6.9251, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0350/1602], Avg Loss: 6.9265, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0350/1602], Avg Loss: 6.9269, Avg Acc: 0.0014
+INFO:master_logger:Epoch[001/300], Step[0350/1602], Avg Loss: 6.9260, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0400/1602], Avg Loss: 6.9208, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0400/1602], Avg Loss: 6.9203, Avg Acc: 0.0014
+INFO:local_logger:Epoch[001/300], Step[0400/1602], Avg Loss: 6.9219, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0400/1602], Avg Loss: 6.9221, Avg Acc: 0.0014
+INFO:master_logger:Epoch[001/300], Step[0400/1602], Avg Loss: 6.9213, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0450/1602], Avg Loss: 6.9162, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0450/1602], Avg Loss: 6.9179, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0450/1602], Avg Loss: 6.9158, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0450/1602], Avg Loss: 6.9177, Avg Acc: 0.0018
+INFO:master_logger:Epoch[001/300], Step[0450/1602], Avg Loss: 6.9169, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0500/1602], Avg Loss: 6.9114, Avg Acc: 0.0015
+INFO:local_logger:Epoch[001/300], Step[0500/1602], Avg Loss: 6.9139, Avg Acc: 0.0016
+INFO:local_logger:Epoch[001/300], Step[0500/1602], Avg Loss: 6.9132, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0500/1602], Avg Loss: 6.9119, Avg Acc: 0.0017
+INFO:master_logger:Epoch[001/300], Step[0500/1602], Avg Loss: 6.9126, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0550/1602], Avg Loss: 6.9072, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0550/1602], Avg Loss: 6.9092, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0550/1602], Avg Loss: 6.9078, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0550/1602], Avg Loss: 6.9084, Avg Acc: 0.0018
+INFO:master_logger:Epoch[001/300], Step[0550/1602], Avg Loss: 6.9081, Avg Acc: 0.0017
+INFO:local_logger:Epoch[001/300], Step[0600/1602], Avg Loss: 6.9021, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0600/1602], Avg Loss: 6.9038, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0600/1602], Avg Loss: 6.9032, Avg Acc: 0.0018
+INFO:local_logger:Epoch[001/300], Step[0600/1602], Avg Loss: 6.9045, Avg Acc: 0.0018
+INFO:master_logger:Epoch[001/300], Step[0600/1602], Avg Loss: 6.9034, Avg Acc: 0.0019
+INFO:local_logger:Epoch[001/300], Step[0650/1602], Avg Loss: 6.8977, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0650/1602], Avg Loss: 6.8996, Avg Acc: 0.0019
+INFO:local_logger:Epoch[001/300], Step[0650/1602], Avg Loss: 6.8984, Avg Acc: 0.0019
+INFO:local_logger:Epoch[001/300], Step[0650/1602], Avg Loss: 6.8992, Avg Acc: 0.0021
+INFO:master_logger:Epoch[001/300], Step[0650/1602], Avg Loss: 6.8987, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0700/1602], Avg Loss: 6.8922, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0700/1602], Avg Loss: 6.8940, Avg Acc: 0.0022
+INFO:local_logger:Epoch[001/300], Step[0700/1602], Avg Loss: 6.8931, Avg Acc: 0.0020
+INFO:local_logger:Epoch[001/300], Step[0700/1602], Avg Loss: 6.8950, Avg Acc: 0.0020
+INFO:master_logger:Epoch[001/300], Step[0700/1602], Avg Loss: 6.8936, Avg Acc: 0.0021
+INFO:local_logger:Epoch[001/300], Step[0750/1602], Avg Loss: 6.8900, Avg Acc: 0.0022
+INFO:local_logger:Epoch[001/300], Step[0750/1602], Avg Loss: 6.8872, Avg Acc: 0.0023
+INFO:local_logger:Epoch[001/300], Step[0750/1602], Avg Loss: 6.8887, Avg Acc: 0.0023
+INFO:local_logger:Epoch[001/300], Step[0750/1602], Avg Loss: 6.8883, Avg Acc: 0.0021
+INFO:master_logger:Epoch[001/300], Step[0750/1602], Avg Loss: 6.8885, Avg Acc: 0.0022
+INFO:local_logger:Epoch[001/300], Step[0800/1602], Avg Loss: 6.8823, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0800/1602], Avg Loss: 6.8828, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0800/1602], Avg Loss: 6.8833, Avg Acc: 0.0022
+INFO:local_logger:Epoch[001/300], Step[0800/1602], Avg Loss: 6.8847, Avg Acc: 0.0023
+INFO:master_logger:Epoch[001/300], Step[0800/1602], Avg Loss: 6.8833, Avg Acc: 0.0023
+INFO:local_logger:Epoch[001/300], Step[0850/1602], Avg Loss: 6.8791, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0850/1602], Avg Loss: 6.8778, Avg Acc: 0.0023
+INFO:local_logger:Epoch[001/300], Step[0850/1602], Avg Loss: 6.8774, Avg Acc: 0.0025
+INFO:local_logger:Epoch[001/300], Step[0850/1602], Avg Loss: 6.8772, Avg Acc: 0.0024
+INFO:master_logger:Epoch[001/300], Step[0850/1602], Avg Loss: 6.8779, Avg Acc: 0.0024
+INFO:local_logger:Epoch[001/300], Step[0900/1602], Avg Loss: 6.8718, Avg Acc: 0.0025
+INFO:local_logger:Epoch[001/300], Step[0900/1602], Avg Loss: 6.8718, Avg Acc: 0.0026
+INFO:local_logger:Epoch[001/300], Step[0900/1602], Avg Loss: 6.8718, Avg Acc: 0.0025
+INFO:master_logger:Epoch[001/300], Step[0900/1602], Avg Loss: 6.8723, Avg Acc: 0.0025
+INFO:local_logger:Epoch[001/300], Step[0900/1602], Avg Loss: 6.8737, Avg Acc: 0.0026
+INFO:local_logger:Epoch[001/300], Step[0950/1602], Avg Loss: 6.8659, Avg Acc: 0.0027
+INFO:local_logger:Epoch[001/300], Step[0950/1602], Avg Loss: 6.8665, Avg Acc: 0.0026
+INFO:local_logger:Epoch[001/300], Step[0950/1602], Avg Loss: 6.8681, Avg Acc: 0.0027
+INFO:local_logger:Epoch[001/300], Step[0950/1602], Avg Loss: 6.8662, Avg Acc: 0.0027
+INFO:master_logger:Epoch[001/300], Step[0950/1602], Avg Loss: 6.8667, Avg Acc: 0.0027
+INFO:local_logger:Epoch[001/300], Step[1000/1602], Avg Loss: 6.8603, Avg Acc: 0.0027
+INFO:local_logger:Epoch[001/300], Step[1000/1602], Avg Loss: 6.8599, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[1000/1602], Avg Loss: 6.8600, Avg Acc: 0.0028
+INFO:master_logger:Epoch[001/300], Step[1000/1602], Avg Loss: 6.8606, Avg Acc: 0.0028
+INFO:local_logger:Epoch[001/300], Step[1000/1602], Avg Loss: 6.8623, Avg Acc: 0.0028
+INFO:local_logger:Epoch[001/300], Step[1050/1602], Avg Loss: 6.8536, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[1050/1602], Avg Loss: 6.8539, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[1050/1602], Avg Loss: 6.8543, Avg Acc: 0.0030
+INFO:local_logger:Epoch[001/300], Step[1050/1602], Avg Loss: 6.8571, Avg Acc: 0.0029
+INFO:master_logger:Epoch[001/300], Step[1050/1602], Avg Loss: 6.8547, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[1100/1602], Avg Loss: 6.8474, Avg Acc: 0.0030
+INFO:local_logger:Epoch[001/300], Step[1100/1602], Avg Loss: 6.8479, Avg Acc: 0.0029
+INFO:local_logger:Epoch[001/300], Step[1100/1602], Avg Loss: 6.8482, Avg Acc: 0.0032
+INFO:master_logger:Epoch[001/300], Step[1100/1602], Avg Loss: 6.8485, Avg Acc: 0.0030
+INFO:local_logger:Epoch[001/300], Step[1100/1602], Avg Loss: 6.8506, Avg Acc: 0.0030
+INFO:local_logger:Epoch[001/300], Step[1150/1602], Avg Loss: 6.8410, Avg Acc: 0.0032
+INFO:local_logger:Epoch[001/300], Step[1150/1602], Avg Loss: 6.8409, Avg Acc: 0.0031
+INFO:local_logger:Epoch[001/300], Step[1150/1602], Avg Loss: 6.8437, Avg Acc: 0.0031
+INFO:local_logger:Epoch[001/300], Step[1150/1602], Avg Loss: 6.8418, Avg Acc: 0.0033
+INFO:master_logger:Epoch[001/300], Step[1150/1602], Avg Loss: 6.8418, Avg Acc: 0.0032
+INFO:local_logger:Epoch[001/300], Step[1200/1602], Avg Loss: 6.8369, Avg Acc: 0.0033
+INFO:local_logger:Epoch[001/300], Step[1200/1602], Avg Loss: 6.8345, Avg Acc: 0.0033
+INFO:local_logger:Epoch[001/300], Step[1200/1602], Avg Loss: 6.8355, Avg Acc: 0.0035
+INFO:local_logger:Epoch[001/300], Step[1200/1602], Avg Loss: 6.8346, Avg Acc: 0.0033
+INFO:master_logger:Epoch[001/300], Step[1200/1602], Avg Loss: 6.8354, Avg Acc: 0.0033
+INFO:local_logger:Epoch[001/300], Step[1250/1602], Avg Loss: 6.8308, Avg Acc: 0.0034
+INFO:local_logger:Epoch[001/300], Step[1250/1602], Avg Loss: 6.8281, Avg Acc: 0.0034
+INFO:local_logger:Epoch[001/300], Step[1250/1602], Avg Loss: 6.8290, Avg Acc: 0.0036
+INFO:local_logger:Epoch[001/300], Step[1250/1602], Avg Loss: 6.8281, Avg Acc: 0.0034
+INFO:master_logger:Epoch[001/300], Step[1250/1602], Avg Loss: 6.8290, Avg Acc: 0.0034
+INFO:local_logger:Epoch[001/300], Step[1300/1602], Avg Loss: 6.8225, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1300/1602], Avg Loss: 6.8220, Avg Acc: 0.0035
+INFO:local_logger:Epoch[001/300], Step[1300/1602], Avg Loss: 6.8217, Avg Acc: 0.0035
+INFO:master_logger:Epoch[001/300], Step[1300/1602], Avg Loss: 6.8226, Avg Acc: 0.0036
+INFO:local_logger:Epoch[001/300], Step[1300/1602], Avg Loss: 6.8242, Avg Acc: 0.0035
+INFO:local_logger:Epoch[001/300], Step[1350/1602], Avg Loss: 6.8156, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1350/1602], Avg Loss: 6.8178, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1350/1602], Avg Loss: 6.8150, Avg Acc: 0.0036
+INFO:local_logger:Epoch[001/300], Step[1350/1602], Avg Loss: 6.8168, Avg Acc: 0.0039
+INFO:master_logger:Epoch[001/300], Step[1350/1602], Avg Loss: 6.8163, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1400/1602], Avg Loss: 6.8092, Avg Acc: 0.0039
+INFO:local_logger:Epoch[001/300], Step[1400/1602], Avg Loss: 6.8114, Avg Acc: 0.0038
+INFO:local_logger:Epoch[001/300], Step[1400/1602], Avg Loss: 6.8088, Avg Acc: 0.0037
+INFO:local_logger:Epoch[001/300], Step[1400/1602], Avg Loss: 6.8111, Avg Acc: 0.0040
+INFO:master_logger:Epoch[001/300], Step[1400/1602], Avg Loss: 6.8101, Avg Acc: 0.0038
+INFO:local_logger:Epoch[001/300], Step[1450/1602], Avg Loss: 6.8028, Avg Acc: 0.0039
+INFO:local_logger:Epoch[001/300], Step[1450/1602], Avg Loss: 6.8027, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1450/1602], Avg Loss: 6.8047, Avg Acc: 0.0042
+INFO:local_logger:Epoch[001/300], Step[1450/1602], Avg Loss: 6.8047, Avg Acc: 0.0040
+INFO:master_logger:Epoch[001/300], Step[1450/1602], Avg Loss: 6.8037, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1500/1602], Avg Loss: 6.7973, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1500/1602], Avg Loss: 6.7979, Avg Acc: 0.0042
+INFO:local_logger:Epoch[001/300], Step[1500/1602], Avg Loss: 6.7972, Avg Acc: 0.0040
+INFO:local_logger:Epoch[001/300], Step[1500/1602], Avg Loss: 6.7988, Avg Acc: 0.0043
+INFO:master_logger:Epoch[001/300], Step[1500/1602], Avg Loss: 6.7978, Avg Acc: 0.0041
+INFO:local_logger:Epoch[001/300], Step[1550/1602], Avg Loss: 6.7911, Avg Acc: 0.0043
+INFO:local_logger:Epoch[001/300], Step[1550/1602], Avg Loss: 6.7906, Avg Acc: 0.0041
+INFO:local_logger:Epoch[001/300], Step[1550/1602], Avg Loss: 6.7923, Avg Acc: 0.0044
+INFO:local_logger:Epoch[001/300], Step[1550/1602], Avg Loss: 6.7908, Avg Acc: 0.0042
+INFO:master_logger:Epoch[001/300], Step[1550/1602], Avg Loss: 6.7912, Avg Acc: 0.0043
+INFO:local_logger:Epoch[001/300], Step[1600/1602], Avg Loss: 6.7868, Avg Acc: 0.0045
+INFO:local_logger:Epoch[001/300], Step[1600/1602], Avg Loss: 6.7852, Avg Acc: 0.0045
+INFO:local_logger:Epoch[001/300], Step[1600/1602], Avg Loss: 6.7848, Avg Acc: 0.0043
+INFO:local_logger:Epoch[001/300], Step[1600/1602], Avg Loss: 6.7852, Avg Acc: 0.0044
+INFO:master_logger:Epoch[001/300], Step[1600/1602], Avg Loss: 6.7855, Avg Acc: 0.0044
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7851, Train Acc: 0.0045, time: 3733.78
+INFO:local_logger:Now training epoch 2. LR=0.000157
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7867, Train Acc: 0.0045, time: 3734.22
+INFO:local_logger:Now training epoch 2. LR=0.000157
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7852, Train Acc: 0.0043, time: 3734.22
+INFO:local_logger:Now training epoch 2. LR=0.000157
+INFO:local_logger:----- Epoch[001/300], Train Loss: 6.7848, Train Acc: 0.0043, time: 3734.22
+INFO:master_logger:----- Epoch[001/300], Train Loss: 6.7854, Train Acc: 0.0044, time: 3734.22
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-1-Loss-6.784782145044278.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-1-Loss-6.784782145044278.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-1-Loss-6.784782145044278-EMA.pdparams
+INFO:local_logger:Now training epoch 2. LR=0.000157
+INFO:master_logger:Now training epoch 2. LR=0.000157
+INFO:local_logger:Epoch[002/300], Step[0000/1602], Avg Loss: 6.4545, Avg Acc: 0.0150
+INFO:local_logger:Epoch[002/300], Step[0000/1602], Avg Loss: 6.6478, Avg Acc: 0.0150
+INFO:local_logger:Epoch[002/300], Step[0000/1602], Avg Loss: 6.6921, Avg Acc: 0.0200
+INFO:local_logger:Epoch[002/300], Step[0000/1602], Avg Loss: 6.5821, Avg Acc: 0.0100
+INFO:master_logger:Epoch[002/300], Step[0000/1602], Avg Loss: 6.5941, Avg Acc: 0.0150
+INFO:local_logger:Epoch[002/300], Step[0050/1602], Avg Loss: 6.6046, Avg Acc: 0.0082
+INFO:local_logger:Epoch[002/300], Step[0050/1602], Avg Loss: 6.5913, Avg Acc: 0.0076
+INFO:local_logger:Epoch[002/300], Step[0050/1602], Avg Loss: 6.5809, Avg Acc: 0.0089
+INFO:master_logger:Epoch[002/300], Step[0050/1602], Avg Loss: 6.5956, Avg Acc: 0.0084
+INFO:local_logger:Epoch[002/300], Step[0050/1602], Avg Loss: 6.6056, Avg Acc: 0.0087
+INFO:local_logger:Epoch[002/300], Step[0100/1602], Avg Loss: 6.5946, Avg Acc: 0.0081
+INFO:local_logger:Epoch[002/300], Step[0100/1602], Avg Loss: 6.5879, Avg Acc: 0.0084
+INFO:local_logger:Epoch[002/300], Step[0100/1602], Avg Loss: 6.5929, Avg Acc: 0.0076
+INFO:local_logger:Epoch[002/300], Step[0100/1602], Avg Loss: 6.5946, Avg Acc: 0.0085
+INFO:master_logger:Epoch[002/300], Step[0100/1602], Avg Loss: 6.5925, Avg Acc: 0.0081
+INFO:local_logger:Epoch[002/300], Step[0150/1602], Avg Loss: 6.5971, Avg Acc: 0.0081
+INFO:local_logger:Epoch[002/300], Step[0150/1602], Avg Loss: 6.5879, Avg Acc: 0.0093
+INFO:local_logger:Epoch[002/300], Step[0150/1602], Avg Loss: 6.5849, Avg Acc: 0.0076
+INFO:local_logger:Epoch[002/300], Step[0150/1602], Avg Loss: 6.5834, Avg Acc: 0.0088
+INFO:master_logger:Epoch[002/300], Step[0150/1602], Avg Loss: 6.5883, Avg Acc: 0.0085
+INFO:local_logger:Epoch[002/300], Step[0200/1602], Avg Loss: 6.5807, Avg Acc: 0.0088
+INFO:local_logger:Epoch[002/300], Step[0200/1602], Avg Loss: 6.5736, Avg Acc: 0.0088
+INFO:local_logger:Epoch[002/300], Step[0200/1602], Avg Loss: 6.5798, Avg Acc: 0.0074
+INFO:local_logger:Epoch[002/300], Step[0200/1602], Avg Loss: 6.5807, Avg Acc: 0.0093
+INFO:master_logger:Epoch[002/300], Step[0200/1602], Avg Loss: 6.5787, Avg Acc: 0.0086
+INFO:local_logger:Epoch[002/300], Step[0250/1602], Avg Loss: 6.5732, Avg Acc: 0.0079
+INFO:local_logger:Epoch[002/300], Step[0250/1602], Avg Loss: 6.5709, Avg Acc: 0.0086
+INFO:local_logger:Epoch[002/300], Step[0250/1602], Avg Loss: 6.5689, Avg Acc: 0.0094
+INFO:local_logger:Epoch[002/300], Step[0250/1602], Avg Loss: 6.5700, Avg Acc: 0.0095
+INFO:master_logger:Epoch[002/300], Step[0250/1602], Avg Loss: 6.5708, Avg Acc: 0.0089
+INFO:local_logger:Epoch[002/300], Step[0300/1602], Avg Loss: 6.5648, Avg Acc: 0.0092
+INFO:local_logger:Epoch[002/300], Step[0300/1602], Avg Loss: 6.5639, Avg Acc: 0.0086
+INFO:local_logger:Epoch[002/300], Step[0300/1602], Avg Loss: 6.5620, Avg Acc: 0.0095
+INFO:local_logger:Epoch[002/300], Step[0300/1602], Avg Loss: 6.5671, Avg Acc: 0.0094
+INFO:master_logger:Epoch[002/300], Step[0300/1602], Avg Loss: 6.5644, Avg Acc: 0.0092
+INFO:local_logger:Epoch[002/300], Step[0350/1602], Avg Loss: 6.5567, Avg Acc: 0.0091
+INFO:local_logger:Epoch[002/300], Step[0350/1602], Avg Loss: 6.5537, Avg Acc: 0.0096
+INFO:local_logger:Epoch[002/300], Step[0350/1602], Avg Loss: 6.5559, Avg Acc: 0.0086
+INFO:master_logger:Epoch[002/300], Step[0350/1602], Avg Loss: 6.5562, Avg Acc: 0.0092
+INFO:local_logger:Epoch[002/300], Step[0350/1602], Avg Loss: 6.5585, Avg Acc: 0.0093
+INFO:local_logger:Epoch[002/300], Step[0400/1602], Avg Loss: 6.5510, Avg Acc: 0.0091
+INFO:local_logger:Epoch[002/300], Step[0400/1602], Avg Loss: 6.5533, Avg Acc: 0.0094
+INFO:local_logger:Epoch[002/300], Step[0400/1602], Avg Loss: 6.5482, Avg Acc: 0.0095
+INFO:master_logger:Epoch[002/300], Step[0400/1602], Avg Loss: 6.5502, Avg Acc: 0.0093
+INFO:local_logger:Epoch[002/300], Step[0400/1602], Avg Loss: 6.5481, Avg Acc: 0.0093
+INFO:local_logger:Epoch[002/300], Step[0450/1602], Avg Loss: 6.5471, Avg Acc: 0.0090
+INFO:local_logger:Epoch[002/300], Step[0450/1602], Avg Loss: 6.5424, Avg Acc: 0.0095
+INFO:local_logger:Epoch[002/300], Step[0450/1602], Avg Loss: 6.5443, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0450/1602], Avg Loss: 6.5489, Avg Acc: 0.0094
+INFO:master_logger:Epoch[002/300], Step[0450/1602], Avg Loss: 6.5457, Avg Acc: 0.0094
+INFO:local_logger:Epoch[002/300], Step[0500/1602], Avg Loss: 6.5427, Avg Acc: 0.0094
+INFO:local_logger:Epoch[002/300], Step[0500/1602], Avg Loss: 6.5398, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0500/1602], Avg Loss: 6.5373, Avg Acc: 0.0095
+INFO:local_logger:Epoch[002/300], Step[0500/1602], Avg Loss: 6.5430, Avg Acc: 0.0096
+INFO:master_logger:Epoch[002/300], Step[0500/1602], Avg Loss: 6.5407, Avg Acc: 0.0096
+INFO:local_logger:Epoch[002/300], Step[0550/1602], Avg Loss: 6.5375, Avg Acc: 0.0095
+INFO:local_logger:Epoch[002/300], Step[0550/1602], Avg Loss: 6.5400, Avg Acc: 0.0097
+INFO:local_logger:Epoch[002/300], Step[0550/1602], Avg Loss: 6.5359, Avg Acc: 0.0098
+INFO:local_logger:Epoch[002/300], Step[0550/1602], Avg Loss: 6.5329, Avg Acc: 0.0096
+INFO:master_logger:Epoch[002/300], Step[0550/1602], Avg Loss: 6.5366, Avg Acc: 0.0097
+INFO:local_logger:Epoch[002/300], Step[0600/1602], Avg Loss: 6.5319, Avg Acc: 0.0101
+INFO:local_logger:Epoch[002/300], Step[0600/1602], Avg Loss: 6.5339, Avg Acc: 0.0095
+INFO:local_logger:Epoch[002/300], Step[0600/1602], Avg Loss: 6.5307, Avg Acc: 0.0097
+INFO:master_logger:Epoch[002/300], Step[0600/1602], Avg Loss: 6.5333, Avg Acc: 0.0098
+INFO:local_logger:Epoch[002/300], Step[0600/1602], Avg Loss: 6.5369, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0650/1602], Avg Loss: 6.5288, Avg Acc: 0.0096
+INFO:local_logger:Epoch[002/300], Step[0650/1602], Avg Loss: 6.5263, Avg Acc: 0.0098
+INFO:master_logger:Epoch[002/300], Step[0650/1602], Avg Loss: 6.5281, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0650/1602], Avg Loss: 6.5251, Avg Acc: 0.0102
+INFO:local_logger:Epoch[002/300], Step[0650/1602], Avg Loss: 6.5323, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0700/1602], Avg Loss: 6.5217, Avg Acc: 0.0102
+INFO:local_logger:Epoch[002/300], Step[0700/1602], Avg Loss: 6.5248, Avg Acc: 0.0097
+INFO:local_logger:Epoch[002/300], Step[0700/1602], Avg Loss: 6.5273, Avg Acc: 0.0099
+INFO:master_logger:Epoch[002/300], Step[0700/1602], Avg Loss: 6.5244, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0700/1602], Avg Loss: 6.5239, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0750/1602], Avg Loss: 6.5210, Avg Acc: 0.0098
+INFO:local_logger:Epoch[002/300], Step[0750/1602], Avg Loss: 6.5228, Avg Acc: 0.0101
+INFO:local_logger:Epoch[002/300], Step[0750/1602], Avg Loss: 6.5183, Avg Acc: 0.0103
+INFO:local_logger:Epoch[002/300], Step[0750/1602], Avg Loss: 6.5175, Avg Acc: 0.0100
+INFO:master_logger:Epoch[002/300], Step[0750/1602], Avg Loss: 6.5199, Avg Acc: 0.0100
+INFO:local_logger:Epoch[002/300], Step[0800/1602], Avg Loss: 6.5199, Avg Acc: 0.0099
+INFO:local_logger:Epoch[002/300], Step[0800/1602], Avg Loss: 6.5174, Avg Acc: 0.0101
+INFO:local_logger:Epoch[002/300], Step[0800/1602], Avg Loss: 6.5145, Avg Acc: 0.0102
+INFO:master_logger:Epoch[002/300], Step[0800/1602], Avg Loss: 6.5164, Avg Acc: 0.0101
+INFO:local_logger:Epoch[002/300], Step[0800/1602], Avg Loss: 6.5136, Avg Acc: 0.0101
+INFO:local_logger:Epoch[002/300], Step[0850/1602], Avg Loss: 6.5136, Avg Acc: 0.0101
+INFO:local_logger:Epoch[002/300], Step[0850/1602], Avg Loss: 6.5110, Avg Acc: 0.0104
+INFO:local_logger:Epoch[002/300], Step[0850/1602], Avg Loss: 6.5105, Avg Acc: 0.0103
+INFO:local_logger:Epoch[002/300], Step[0850/1602], Avg Loss: 6.5113, Avg Acc: 0.0101
+INFO:master_logger:Epoch[002/300], Step[0850/1602], Avg Loss: 6.5116, Avg Acc: 0.0102
+INFO:local_logger:Epoch[002/300], Step[0900/1602], Avg Loss: 6.5106, Avg Acc: 0.0103
+INFO:local_logger:Epoch[002/300], Step[0900/1602], Avg Loss: 6.5054, Avg Acc: 0.0105
+INFO:local_logger:Epoch[002/300], Step[0900/1602], Avg Loss: 6.5065, Avg Acc: 0.0103
+INFO:local_logger:Epoch[002/300], Step[0900/1602], Avg Loss: 6.5059, Avg Acc: 0.0105
+INFO:master_logger:Epoch[002/300], Step[0900/1602], Avg Loss: 6.5071, Avg Acc: 0.0104
+INFO:local_logger:Epoch[002/300], Step[0950/1602], Avg Loss: 6.5038, Avg Acc: 0.0105
+INFO:local_logger:Epoch[002/300], Step[0950/1602], Avg Loss: 6.5017, Avg Acc: 0.0107
+INFO:local_logger:Epoch[002/300], Step[0950/1602], Avg Loss: 6.5021, Avg Acc: 0.0105
+INFO:local_logger:Epoch[002/300], Step[0950/1602], Avg Loss: 6.5018, Avg Acc: 0.0103
+INFO:master_logger:Epoch[002/300], Step[0950/1602], Avg Loss: 6.5023, Avg Acc: 0.0105
+INFO:local_logger:Epoch[002/300], Step[1000/1602], Avg Loss: 6.5010, Avg Acc: 0.0106
+INFO:local_logger:Epoch[002/300], Step[1000/1602], Avg Loss: 6.4980, Avg Acc: 0.0107
+INFO:local_logger:Epoch[002/300], Step[1000/1602], Avg Loss: 6.4974, Avg Acc: 0.0108
+INFO:master_logger:Epoch[002/300], Step[1000/1602], Avg Loss: 6.4983, Avg Acc: 0.0107
+INFO:local_logger:Epoch[002/300], Step[1000/1602], Avg Loss: 6.4968, Avg Acc: 0.0106
+INFO:local_logger:Epoch[002/300], Step[1050/1602], Avg Loss: 6.4956, Avg Acc: 0.0108
+INFO:local_logger:Epoch[002/300], Step[1050/1602], Avg Loss: 6.4925, Avg Acc: 0.0108
+INFO:local_logger:Epoch[002/300], Step[1050/1602], Avg Loss: 6.4936, Avg Acc: 0.0108
+INFO:local_logger:Epoch[002/300], Step[1050/1602], Avg Loss: 6.4938, Avg Acc: 0.0108
+INFO:master_logger:Epoch[002/300], Step[1050/1602], Avg Loss: 6.4939, Avg Acc: 0.0108
+INFO:local_logger:Epoch[002/300], Step[1100/1602], Avg Loss: 6.4874, Avg Acc: 0.0108
+INFO:local_logger:Epoch[002/300], Step[1100/1602], Avg Loss: 6.4903, Avg Acc: 0.0110
+INFO:local_logger:Epoch[002/300], Step[1100/1602], Avg Loss: 6.4896, Avg Acc: 0.0109
+INFO:local_logger:Epoch[002/300], Step[1100/1602], Avg Loss: 6.4887, Avg Acc: 0.0111
+INFO:master_logger:Epoch[002/300], Step[1100/1602], Avg Loss: 6.4890, Avg Acc: 0.0110
+INFO:local_logger:Epoch[002/300], Step[1150/1602], Avg Loss: 6.4863, Avg Acc: 0.0111
+INFO:local_logger:Epoch[002/300], Step[1150/1602], Avg Loss: 6.4839, Avg Acc: 0.0112
+INFO:local_logger:Epoch[002/300], Step[1150/1602], Avg Loss: 6.4846, Avg Acc: 0.0111
+INFO:master_logger:Epoch[002/300], Step[1150/1602], Avg Loss: 6.4848, Avg Acc: 0.0111
+INFO:local_logger:Epoch[002/300], Step[1150/1602], Avg Loss: 6.4846, Avg Acc: 0.0109
+INFO:local_logger:Epoch[002/300], Step[1200/1602], Avg Loss: 6.4834, Avg Acc: 0.0112
+INFO:local_logger:Epoch[002/300], Step[1200/1602], Avg Loss: 6.4815, Avg Acc: 0.0109
+INFO:local_logger:Epoch[002/300], Step[1200/1602], Avg Loss: 6.4798, Avg Acc: 0.0114
+INFO:local_logger:Epoch[002/300], Step[1200/1602], Avg Loss: 6.4804, Avg Acc: 0.0112
+INFO:master_logger:Epoch[002/300], Step[1200/1602], Avg Loss: 6.4813, Avg Acc: 0.0112
+INFO:local_logger:Epoch[002/300], Step[1250/1602], Avg Loss: 6.4751, Avg Acc: 0.0114
+INFO:local_logger:Epoch[002/300], Step[1250/1602], Avg Loss: 6.4815, Avg Acc: 0.0112
+INFO:local_logger:Epoch[002/300], Step[1250/1602], Avg Loss: 6.4763, Avg Acc: 0.0115
+INFO:local_logger:Epoch[002/300], Step[1250/1602], Avg Loss: 6.4784, Avg Acc: 0.0111
+INFO:master_logger:Epoch[002/300], Step[1250/1602], Avg Loss: 6.4778, Avg Acc: 0.0113
+INFO:local_logger:Epoch[002/300], Step[1300/1602], Avg Loss: 6.4790, Avg Acc: 0.0114
+INFO:local_logger:Epoch[002/300], Step[1300/1602], Avg Loss: 6.4716, Avg Acc: 0.0113
+INFO:local_logger:Epoch[002/300], Step[1300/1602], Avg Loss: 6.4720, Avg Acc: 0.0115
+INFO:master_logger:Epoch[002/300], Step[1300/1602], Avg Loss: 6.4742, Avg Acc: 0.0114
+INFO:local_logger:Epoch[002/300], Step[1300/1602], Avg Loss: 6.4744, Avg Acc: 0.0112
+INFO:local_logger:Epoch[002/300], Step[1350/1602], Avg Loss: 6.4701, Avg Acc: 0.0113
+INFO:local_logger:Epoch[002/300], Step[1350/1602], Avg Loss: 6.4756, Avg Acc: 0.0115
+INFO:local_logger:Epoch[002/300], Step[1350/1602], Avg Loss: 6.4696, Avg Acc: 0.0114
+INFO:local_logger:Epoch[002/300], Step[1350/1602], Avg Loss: 6.4687, Avg Acc: 0.0116
+INFO:master_logger:Epoch[002/300], Step[1350/1602], Avg Loss: 6.4710, Avg Acc: 0.0115
+INFO:local_logger:Epoch[002/300], Step[1400/1602], Avg Loss: 6.4681, Avg Acc: 0.0113
+INFO:local_logger:Epoch[002/300], Step[1400/1602], Avg Loss: 6.4723, Avg Acc: 0.0116
+INFO:local_logger:Epoch[002/300], Step[1400/1602], Avg Loss: 6.4647, Avg Acc: 0.0115
+INFO:local_logger:Epoch[002/300], Step[1400/1602], Avg Loss: 6.4658, Avg Acc: 0.0116
+INFO:master_logger:Epoch[002/300], Step[1400/1602], Avg Loss: 6.4677, Avg Acc: 0.0115
+INFO:local_logger:Epoch[002/300], Step[1450/1602], Avg Loss: 6.4622, Avg Acc: 0.0117
+INFO:local_logger:Epoch[002/300], Step[1450/1602], Avg Loss: 6.4689, Avg Acc: 0.0117
+INFO:local_logger:Epoch[002/300], Step[1450/1602], Avg Loss: 6.4606, Avg Acc: 0.0118
+INFO:local_logger:Epoch[002/300], Step[1450/1602], Avg Loss: 6.4651, Avg Acc: 0.0114
+INFO:master_logger:Epoch[002/300], Step[1450/1602], Avg Loss: 6.4642, Avg Acc: 0.0116
+INFO:local_logger:Epoch[002/300], Step[1500/1602], Avg Loss: 6.4659, Avg Acc: 0.0117
+INFO:local_logger:Epoch[002/300], Step[1500/1602], Avg Loss: 6.4594, Avg Acc: 0.0118
+INFO:local_logger:Epoch[002/300], Step[1500/1602], Avg Loss: 6.4583, Avg Acc: 0.0119
+INFO:local_logger:Epoch[002/300], Step[1500/1602], Avg Loss: 6.4623, Avg Acc: 0.0115
+INFO:master_logger:Epoch[002/300], Step[1500/1602], Avg Loss: 6.4615, Avg Acc: 0.0117
+INFO:local_logger:Epoch[002/300], Step[1550/1602], Avg Loss: 6.4630, Avg Acc: 0.0117
+INFO:master_logger:Epoch[002/300], Step[1550/1602], Avg Loss: 6.4584, Avg Acc: 0.0118
+INFO:local_logger:Epoch[002/300], Step[1550/1602], Avg Loss: 6.4552, Avg Acc: 0.0119
+INFO:local_logger:Epoch[002/300], Step[1550/1602], Avg Loss: 6.4565, Avg Acc: 0.0119
+INFO:local_logger:Epoch[002/300], Step[1550/1602], Avg Loss: 6.4588, Avg Acc: 0.0115
+INFO:local_logger:Epoch[002/300], Step[1600/1602], Avg Loss: 6.4521, Avg Acc: 0.0120
+INFO:local_logger:Epoch[002/300], Step[1600/1602], Avg Loss: 6.4557, Avg Acc: 0.0117
+INFO:local_logger:Epoch[002/300], Step[1600/1602], Avg Loss: 6.4603, Avg Acc: 0.0118
+INFO:local_logger:Epoch[002/300], Step[1600/1602], Avg Loss: 6.4520, Avg Acc: 0.0122
+INFO:master_logger:Epoch[002/300], Step[1600/1602], Avg Loss: 6.4550, Avg Acc: 0.0119
+INFO:local_logger:----- Epoch[002/300], Train Loss: 6.4521, Train Acc: 0.0120, time: 3698.83
+INFO:local_logger:Now training epoch 3. LR=0.000235
+INFO:local_logger:----- Epoch[002/300], Train Loss: 6.4603, Train Acc: 0.0118, time: 3698.63
+INFO:master_logger:----- Epoch[002/300], Train Loss: 6.4551, Train Acc: 0.0119, time: 3698.63
+INFO:local_logger:----- Epoch[002/300], Train Loss: 6.4521, Train Acc: 0.0122, time: 3698.91
+INFO:local_logger:Now training epoch 3. LR=0.000235
+INFO:local_logger:----- Epoch[002/300], Train Loss: 6.4557, Train Acc: 0.0117, time: 3699.39
+INFO:local_logger:Now training epoch 3. LR=0.000235
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-2-Loss-6.460342179315635.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-2-Loss-6.460342179315635.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-2-Loss-6.460342179315635-EMA.pdparams
+INFO:local_logger:Now training epoch 3. LR=0.000235
+INFO:master_logger:Now training epoch 3. LR=0.000235
+INFO:local_logger:Epoch[003/300], Step[0000/1602], Avg Loss: 6.1184, Avg Acc: 0.0150
+INFO:local_logger:Epoch[003/300], Step[0000/1602], Avg Loss: 6.1492, Avg Acc: 0.0150
+INFO:local_logger:Epoch[003/300], Step[0000/1602], Avg Loss: 6.4746, Avg Acc: 0.0200
+INFO:local_logger:Epoch[003/300], Step[0000/1602], Avg Loss: 6.4393, Avg Acc: 0.0050
+INFO:master_logger:Epoch[003/300], Step[0000/1602], Avg Loss: 6.2954, Avg Acc: 0.0137
+INFO:local_logger:Epoch[003/300], Step[0050/1602], Avg Loss: 6.3534, Avg Acc: 0.0126
+INFO:local_logger:Epoch[003/300], Step[0050/1602], Avg Loss: 6.3857, Avg Acc: 0.0151
+INFO:local_logger:Epoch[003/300], Step[0050/1602], Avg Loss: 6.3850, Avg Acc: 0.0133
+INFO:local_logger:Epoch[003/300], Step[0050/1602], Avg Loss: 6.3927, Avg Acc: 0.0137
+INFO:master_logger:Epoch[003/300], Step[0050/1602], Avg Loss: 6.3792, Avg Acc: 0.0137
+INFO:local_logger:Epoch[003/300], Step[0100/1602], Avg Loss: 6.3686, Avg Acc: 0.0135
+INFO:local_logger:Epoch[003/300], Step[0100/1602], Avg Loss: 6.3742, Avg Acc: 0.0154
+INFO:local_logger:Epoch[003/300], Step[0100/1602], Avg Loss: 6.3837, Avg Acc: 0.0143
+INFO:local_logger:Epoch[003/300], Step[0100/1602], Avg Loss: 6.3745, Avg Acc: 0.0156
+INFO:master_logger:Epoch[003/300], Step[0100/1602], Avg Loss: 6.3753, Avg Acc: 0.0147
+INFO:local_logger:Epoch[003/300], Step[0150/1602], Avg Loss: 6.3682, Avg Acc: 0.0129
+INFO:local_logger:Epoch[003/300], Step[0150/1602], Avg Loss: 6.3784, Avg Acc: 0.0148
+INFO:local_logger:Epoch[003/300], Step[0150/1602], Avg Loss: 6.3744, Avg Acc: 0.0154
+INFO:master_logger:Epoch[003/300], Step[0150/1602], Avg Loss: 6.3715, Avg Acc: 0.0146
+INFO:local_logger:Epoch[003/300], Step[0150/1602], Avg Loss: 6.3650, Avg Acc: 0.0155
+INFO:local_logger:Epoch[003/300], Step[0200/1602], Avg Loss: 6.3717, Avg Acc: 0.0138
+INFO:local_logger:Epoch[003/300], Step[0200/1602], Avg Loss: 6.3704, Avg Acc: 0.0154
+INFO:local_logger:Epoch[003/300], Step[0200/1602], Avg Loss: 6.3701, Avg Acc: 0.0150
+INFO:local_logger:Epoch[003/300], Step[0200/1602], Avg Loss: 6.3669, Avg Acc: 0.0154
+INFO:master_logger:Epoch[003/300], Step[0200/1602], Avg Loss: 6.3698, Avg Acc: 0.0149
+INFO:local_logger:Epoch[003/300], Step[0250/1602], Avg Loss: 6.3644, Avg Acc: 0.0140
+INFO:local_logger:Epoch[003/300], Step[0250/1602], Avg Loss: 6.3730, Avg Acc: 0.0142
+INFO:local_logger:Epoch[003/300], Step[0250/1602], Avg Loss: 6.3597, Avg Acc: 0.0155
+INFO:local_logger:Epoch[003/300], Step[0250/1602], Avg Loss: 6.3700, Avg Acc: 0.0158
+INFO:master_logger:Epoch[003/300], Step[0250/1602], Avg Loss: 6.3668, Avg Acc: 0.0149
+INFO:local_logger:Epoch[003/300], Step[0300/1602], Avg Loss: 6.3701, Avg Acc: 0.0156
+INFO:local_logger:Epoch[003/300], Step[0300/1602], Avg Loss: 6.3633, Avg Acc: 0.0141
+INFO:local_logger:Epoch[003/300], Step[0300/1602], Avg Loss: 6.3639, Avg Acc: 0.0156
+INFO:local_logger:Epoch[003/300], Step[0300/1602], Avg Loss: 6.3692, Avg Acc: 0.0142
+INFO:master_logger:Epoch[003/300], Step[0300/1602], Avg Loss: 6.3666, Avg Acc: 0.0149
+INFO:local_logger:Epoch[003/300], Step[0350/1602], Avg Loss: 6.3645, Avg Acc: 0.0138
+INFO:local_logger:Epoch[003/300], Step[0350/1602], Avg Loss: 6.3563, Avg Acc: 0.0152
+INFO:local_logger:Epoch[003/300], Step[0350/1602], Avg Loss: 6.3612, Avg Acc: 0.0143
+INFO:local_logger:Epoch[003/300], Step[0350/1602], Avg Loss: 6.3582, Avg Acc: 0.0158
+INFO:master_logger:Epoch[003/300], Step[0350/1602], Avg Loss: 6.3600, Avg Acc: 0.0148
+INFO:local_logger:Epoch[003/300], Step[0400/1602], Avg Loss: 6.3573, Avg Acc: 0.0143
+INFO:local_logger:Epoch[003/300], Step[0400/1602], Avg Loss: 6.3516, Avg Acc: 0.0148
+INFO:local_logger:Epoch[003/300], Step[0400/1602], Avg Loss: 6.3553, Avg Acc: 0.0151
+INFO:local_logger:Epoch[003/300], Step[0400/1602], Avg Loss: 6.3530, Avg Acc: 0.0157
+INFO:master_logger:Epoch[003/300], Step[0400/1602], Avg Loss: 6.3543, Avg Acc: 0.0150
+INFO:local_logger:Epoch[003/300], Step[0450/1602], Avg Loss: 6.3521, Avg Acc: 0.0156
+INFO:local_logger:Epoch[003/300], Step[0450/1602], Avg Loss: 6.3528, Avg Acc: 0.0148
+INFO:local_logger:Epoch[003/300], Step[0450/1602], Avg Loss: 6.3515, Avg Acc: 0.0151
+INFO:local_logger:Epoch[003/300], Step[0450/1602], Avg Loss: 6.3422, Avg Acc: 0.0152
+INFO:master_logger:Epoch[003/300], Step[0450/1602], Avg Loss: 6.3497, Avg Acc: 0.0152
+INFO:local_logger:Epoch[003/300], Step[0500/1602], Avg Loss: 6.3469, Avg Acc: 0.0151
+INFO:local_logger:Epoch[003/300], Step[0500/1602], Avg Loss: 6.3408, Avg Acc: 0.0153
+INFO:local_logger:Epoch[003/300], Step[0500/1602], Avg Loss: 6.3469, Avg Acc: 0.0158
+INFO:master_logger:Epoch[003/300], Step[0500/1602], Avg Loss: 6.3461, Avg Acc: 0.0155
+INFO:local_logger:Epoch[003/300], Step[0500/1602], Avg Loss: 6.3499, Avg Acc: 0.0157
+INFO:local_logger:Epoch[003/300], Step[0550/1602], Avg Loss: 6.3469, Avg Acc: 0.0151
+INFO:local_logger:Epoch[003/300], Step[0550/1602], Avg Loss: 6.3383, Avg Acc: 0.0154
+INFO:local_logger:Epoch[003/300], Step[0550/1602], Avg Loss: 6.3440, Avg Acc: 0.0156
+INFO:local_logger:Epoch[003/300], Step[0550/1602], Avg Loss: 6.3426, Avg Acc: 0.0159
+INFO:master_logger:Epoch[003/300], Step[0550/1602], Avg Loss: 6.3430, Avg Acc: 0.0155
+INFO:local_logger:Epoch[003/300], Step[0600/1602], Avg Loss: 6.3374, Avg Acc: 0.0156
+INFO:local_logger:Epoch[003/300], Step[0600/1602], Avg Loss: 6.3433, Avg Acc: 0.0155
+INFO:local_logger:Epoch[003/300], Step[0600/1602], Avg Loss: 6.3406, Avg Acc: 0.0150
+INFO:local_logger:Epoch[003/300], Step[0600/1602], Avg Loss: 6.3364, Avg Acc: 0.0157
+INFO:master_logger:Epoch[003/300], Step[0600/1602], Avg Loss: 6.3394, Avg Acc: 0.0154
+INFO:local_logger:Epoch[003/300], Step[0650/1602], Avg Loss: 6.3390, Avg Acc: 0.0156
+INFO:master_logger:Epoch[003/300], Step[0650/1602], Avg Loss: 6.3364, Avg Acc: 0.0157
+INFO:local_logger:Epoch[003/300], Step[0650/1602], Avg Loss: 6.3319, Avg Acc: 0.0158
+INFO:local_logger:Epoch[003/300], Step[0650/1602], Avg Loss: 6.3359, Avg Acc: 0.0159
+INFO:local_logger:Epoch[003/300], Step[0650/1602], Avg Loss: 6.3388, Avg Acc: 0.0156
+INFO:local_logger:Epoch[003/300], Step[0700/1602], Avg Loss: 6.3358, Avg Acc: 0.0157
+INFO:local_logger:Epoch[003/300], Step[0700/1602], Avg Loss: 6.3383, Avg Acc: 0.0157
+INFO:local_logger:Epoch[003/300], Step[0700/1602], Avg Loss: 6.3334, Avg Acc: 0.0161
+INFO:local_logger:Epoch[003/300], Step[0700/1602], Avg Loss: 6.3274, Avg Acc: 0.0162
+INFO:master_logger:Epoch[003/300], Step[0700/1602], Avg Loss: 6.3337, Avg Acc: 0.0159
+INFO:local_logger:Epoch[003/300], Step[0750/1602], Avg Loss: 6.3294, Avg Acc: 0.0159
+INFO:local_logger:Epoch[003/300], Step[0750/1602], Avg Loss: 6.3247, Avg Acc: 0.0163
+INFO:local_logger:Epoch[003/300], Step[0750/1602], Avg Loss: 6.3303, Avg Acc: 0.0162
+INFO:local_logger:Epoch[003/300], Step[0750/1602], Avg Loss: 6.3342, Avg Acc: 0.0160
+INFO:master_logger:Epoch[003/300], Step[0750/1602], Avg Loss: 6.3296, Avg Acc: 0.0161
+INFO:local_logger:Epoch[003/300], Step[0800/1602], Avg Loss: 6.3266, Avg Acc: 0.0162
+INFO:local_logger:Epoch[003/300], Step[0800/1602], Avg Loss: 6.3279, Avg Acc: 0.0160
+INFO:local_logger:Epoch[003/300], Step[0800/1602], Avg Loss: 6.3324, Avg Acc: 0.0161
+INFO:local_logger:Epoch[003/300], Step[0800/1602], Avg Loss: 6.3235, Avg Acc: 0.0165
+INFO:master_logger:Epoch[003/300], Step[0800/1602], Avg Loss: 6.3276, Avg Acc: 0.0162
+INFO:local_logger:Epoch[003/300], Step[0850/1602], Avg Loss: 6.3218, Avg Acc: 0.0166
+INFO:local_logger:Epoch[003/300], Step[0850/1602], Avg Loss: 6.3260, Avg Acc: 0.0161
+INFO:local_logger:Epoch[003/300], Step[0850/1602], Avg Loss: 6.3281, Avg Acc: 0.0164
+INFO:local_logger:Epoch[003/300], Step[0850/1602], Avg Loss: 6.3235, Avg Acc: 0.0163
+INFO:master_logger:Epoch[003/300], Step[0850/1602], Avg Loss: 6.3248, Avg Acc: 0.0163
+INFO:local_logger:Epoch[003/300], Step[0900/1602], Avg Loss: 6.3226, Avg Acc: 0.0164
+INFO:local_logger:Epoch[003/300], Step[0900/1602], Avg Loss: 6.3221, Avg Acc: 0.0163
+INFO:local_logger:Epoch[003/300], Step[0900/1602], Avg Loss: 6.3239, Avg Acc: 0.0164
+INFO:master_logger:Epoch[003/300], Step[0900/1602], Avg Loss: 6.3216, Avg Acc: 0.0164
+INFO:local_logger:Epoch[003/300], Step[0900/1602], Avg Loss: 6.3177, Avg Acc: 0.0167
+INFO:local_logger:Epoch[003/300], Step[0950/1602], Avg Loss: 6.3206, Avg Acc: 0.0165
+INFO:local_logger:Epoch[003/300], Step[0950/1602], Avg Loss: 6.3184, Avg Acc: 0.0165
+INFO:local_logger:Epoch[003/300], Step[0950/1602], Avg Loss: 6.3202, Avg Acc: 0.0164
+INFO:local_logger:Epoch[003/300], Step[0950/1602], Avg Loss: 6.3128, Avg Acc: 0.0168
+INFO:master_logger:Epoch[003/300], Step[0950/1602], Avg Loss: 6.3180, Avg Acc: 0.0165
+INFO:local_logger:Epoch[003/300], Step[1000/1602], Avg Loss: 6.3168, Avg Acc: 0.0167
+INFO:local_logger:Epoch[003/300], Step[1000/1602], Avg Loss: 6.3091, Avg Acc: 0.0170
+INFO:local_logger:Epoch[003/300], Step[1000/1602], Avg Loss: 6.3157, Avg Acc: 0.0166
+INFO:master_logger:Epoch[003/300], Step[1000/1602], Avg Loss: 6.3139, Avg Acc: 0.0168
+INFO:local_logger:Epoch[003/300], Step[1000/1602], Avg Loss: 6.3139, Avg Acc: 0.0169
+INFO:local_logger:Epoch[003/300], Step[1050/1602], Avg Loss: 6.3055, Avg Acc: 0.0171
+INFO:local_logger:Epoch[003/300], Step[1050/1602], Avg Loss: 6.3144, Avg Acc: 0.0169
+INFO:local_logger:Epoch[003/300], Step[1050/1602], Avg Loss: 6.3088, Avg Acc: 0.0171
+INFO:local_logger:Epoch[003/300], Step[1050/1602], Avg Loss: 6.3110, Avg Acc: 0.0167
+INFO:master_logger:Epoch[003/300], Step[1050/1602], Avg Loss: 6.3099, Avg Acc: 0.0170
+INFO:local_logger:Epoch[003/300], Step[1100/1602], Avg Loss: 6.3086, Avg Acc: 0.0171
+INFO:local_logger:Epoch[003/300], Step[1100/1602], Avg Loss: 6.3038, Avg Acc: 0.0174
+INFO:local_logger:Epoch[003/300], Step[1100/1602], Avg Loss: 6.3076, Avg Acc: 0.0169
+INFO:local_logger:Epoch[003/300], Step[1100/1602], Avg Loss: 6.3069, Avg Acc: 0.0170
+INFO:master_logger:Epoch[003/300], Step[1100/1602], Avg Loss: 6.3067, Avg Acc: 0.0171
+INFO:local_logger:Epoch[003/300], Step[1150/1602], Avg Loss: 6.3068, Avg Acc: 0.0171
+INFO:local_logger:Epoch[003/300], Step[1150/1602], Avg Loss: 6.2994, Avg Acc: 0.0176
+INFO:local_logger:Epoch[003/300], Step[1150/1602], Avg Loss: 6.3048, Avg Acc: 0.0171
+INFO:local_logger:Epoch[003/300], Step[1150/1602], Avg Loss: 6.3050, Avg Acc: 0.0171
+INFO:master_logger:Epoch[003/300], Step[1150/1602], Avg Loss: 6.3040, Avg Acc: 0.0172
+INFO:local_logger:Epoch[003/300], Step[1200/1602], Avg Loss: 6.2998, Avg Acc: 0.0177
+INFO:local_logger:Epoch[003/300], Step[1200/1602], Avg Loss: 6.3009, Avg Acc: 0.0172
+INFO:local_logger:Epoch[003/300], Step[1200/1602], Avg Loss: 6.3033, Avg Acc: 0.0175
+INFO:local_logger:Epoch[003/300], Step[1200/1602], Avg Loss: 6.3019, Avg Acc: 0.0172
+INFO:master_logger:Epoch[003/300], Step[1200/1602], Avg Loss: 6.3015, Avg Acc: 0.0174
+INFO:local_logger:Epoch[003/300], Step[1250/1602], Avg Loss: 6.2986, Avg Acc: 0.0175
+INFO:local_logger:Epoch[003/300], Step[1250/1602], Avg Loss: 6.2973, Avg Acc: 0.0173
+INFO:local_logger:Epoch[003/300], Step[1250/1602], Avg Loss: 6.3006, Avg Acc: 0.0176
+INFO:local_logger:Epoch[003/300], Step[1250/1602], Avg Loss: 6.2968, Avg Acc: 0.0177
+INFO:master_logger:Epoch[003/300], Step[1250/1602], Avg Loss: 6.2984, Avg Acc: 0.0175
+INFO:local_logger:Epoch[003/300], Step[1300/1602], Avg Loss: 6.2967, Avg Acc: 0.0177
+INFO:local_logger:Epoch[003/300], Step[1300/1602], Avg Loss: 6.2926, Avg Acc: 0.0179
+INFO:master_logger:Epoch[003/300], Step[1300/1602], Avg Loss: 6.2947, Avg Acc: 0.0177
+INFO:local_logger:Epoch[003/300], Step[1300/1602], Avg Loss: 6.2941, Avg Acc: 0.0175
+INFO:local_logger:Epoch[003/300], Step[1300/1602], Avg Loss: 6.2956, Avg Acc: 0.0177
+INFO:local_logger:Epoch[003/300], Step[1350/1602], Avg Loss: 6.2926, Avg Acc: 0.0180
+INFO:local_logger:Epoch[003/300], Step[1350/1602], Avg Loss: 6.2902, Avg Acc: 0.0179
+INFO:local_logger:Epoch[003/300], Step[1350/1602], Avg Loss: 6.2932, Avg Acc: 0.0178
+INFO:local_logger:Epoch[003/300], Step[1350/1602], Avg Loss: 6.2905, Avg Acc: 0.0176
+INFO:master_logger:Epoch[003/300], Step[1350/1602], Avg Loss: 6.2916, Avg Acc: 0.0178
+INFO:local_logger:Epoch[003/300], Step[1400/1602], Avg Loss: 6.2883, Avg Acc: 0.0181
+INFO:local_logger:Epoch[003/300], Step[1400/1602], Avg Loss: 6.2901, Avg Acc: 0.0181
+INFO:local_logger:Epoch[003/300], Step[1400/1602], Avg Loss: 6.2883, Avg Acc: 0.0177
+INFO:local_logger:Epoch[003/300], Step[1400/1602], Avg Loss: 6.2866, Avg Acc: 0.0180
+INFO:master_logger:Epoch[003/300], Step[1400/1602], Avg Loss: 6.2883, Avg Acc: 0.0180
+INFO:local_logger:Epoch[003/300], Step[1450/1602], Avg Loss: 6.2841, Avg Acc: 0.0180
+INFO:local_logger:Epoch[003/300], Step[1450/1602], Avg Loss: 6.2871, Avg Acc: 0.0182
+INFO:local_logger:Epoch[003/300], Step[1450/1602], Avg Loss: 6.2857, Avg Acc: 0.0183
+INFO:master_logger:Epoch[003/300], Step[1450/1602], Avg Loss: 6.2853, Avg Acc: 0.0181
+INFO:local_logger:Epoch[003/300], Step[1450/1602], Avg Loss: 6.2844, Avg Acc: 0.0179
+INFO:local_logger:Epoch[003/300], Step[1500/1602], Avg Loss: 6.2862, Avg Acc: 0.0182
+INFO:local_logger:Epoch[003/300], Step[1500/1602], Avg Loss: 6.2831, Avg Acc: 0.0184
+INFO:master_logger:Epoch[003/300], Step[1500/1602], Avg Loss: 6.2833, Avg Acc: 0.0183
+INFO:local_logger:Epoch[003/300], Step[1500/1602], Avg Loss: 6.2813, Avg Acc: 0.0183
+INFO:local_logger:Epoch[003/300], Step[1500/1602], Avg Loss: 6.2826, Avg Acc: 0.0181
+INFO:local_logger:Epoch[003/300], Step[1550/1602], Avg Loss: 6.2847, Avg Acc: 0.0182
+INFO:local_logger:Epoch[003/300], Step[1550/1602], Avg Loss: 6.2804, Avg Acc: 0.0181
+INFO:local_logger:Epoch[003/300], Step[1550/1602], Avg Loss: 6.2804, Avg Acc: 0.0186
+INFO:local_logger:Epoch[003/300], Step[1550/1602], Avg Loss: 6.2785, Avg Acc: 0.0184
+INFO:master_logger:Epoch[003/300], Step[1550/1602], Avg Loss: 6.2810, Avg Acc: 0.0183
+INFO:local_logger:Epoch[003/300], Step[1600/1602], Avg Loss: 6.2827, Avg Acc: 0.0183
+INFO:local_logger:Epoch[003/300], Step[1600/1602], Avg Loss: 6.2760, Avg Acc: 0.0186
+INFO:local_logger:Epoch[003/300], Step[1600/1602], Avg Loss: 6.2771, Avg Acc: 0.0183
+INFO:local_logger:Epoch[003/300], Step[1600/1602], Avg Loss: 6.2771, Avg Acc: 0.0185
+INFO:master_logger:Epoch[003/300], Step[1600/1602], Avg Loss: 6.2782, Avg Acc: 0.0184
+INFO:local_logger:----- Epoch[003/300], Train Loss: 6.2827, Train Acc: 0.0183, time: 3689.32
+INFO:master_logger:----- Epoch[003/300], Train Loss: 6.2782, Train Acc: 0.0184, time: 3689.32
+INFO:local_logger:----- Epoch[003/300], Train Loss: 6.2772, Train Acc: 0.0183, time: 3689.79
+INFO:local_logger:Now training epoch 4. LR=0.000313
+INFO:local_logger:----- Epoch[003/300], Train Loss: 6.2770, Train Acc: 0.0185, time: 3689.68
+INFO:local_logger:Now training epoch 4. LR=0.000313
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-3-Loss-6.282672873921686.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-3-Loss-6.282672873921686.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-3-Loss-6.282672873921686-EMA.pdparams
+INFO:local_logger:Now training epoch 4. LR=0.000313
+INFO:master_logger:Now training epoch 4. LR=0.000313
+INFO:local_logger:----- Epoch[003/300], Train Loss: 6.2759, Train Acc: 0.0186, time: 3689.92
+INFO:local_logger:Now training epoch 4. LR=0.000313
+INFO:local_logger:Epoch[004/300], Step[0000/1602], Avg Loss: 6.2937, Avg Acc: 0.0450
+INFO:local_logger:Epoch[004/300], Step[0000/1602], Avg Loss: 6.5505, Avg Acc: 0.0300
+INFO:local_logger:Epoch[004/300], Step[0000/1602], Avg Loss: 6.1196, Avg Acc: 0.0400
+INFO:local_logger:Epoch[004/300], Step[0000/1602], Avg Loss: 6.2410, Avg Acc: 0.0100
+INFO:master_logger:Epoch[004/300], Step[0000/1602], Avg Loss: 6.3012, Avg Acc: 0.0312
+INFO:local_logger:Epoch[004/300], Step[0050/1602], Avg Loss: 6.1833, Avg Acc: 0.0199
+INFO:local_logger:Epoch[004/300], Step[0050/1602], Avg Loss: 6.1738, Avg Acc: 0.0264
+INFO:local_logger:Epoch[004/300], Step[0050/1602], Avg Loss: 6.2152, Avg Acc: 0.0243
+INFO:local_logger:Epoch[004/300], Step[0050/1602], Avg Loss: 6.2021, Avg Acc: 0.0246
+INFO:master_logger:Epoch[004/300], Step[0050/1602], Avg Loss: 6.1936, Avg Acc: 0.0238
+INFO:local_logger:Epoch[004/300], Step[0100/1602], Avg Loss: 6.1712, Avg Acc: 0.0247
+INFO:local_logger:Epoch[004/300], Step[0100/1602], Avg Loss: 6.2009, Avg Acc: 0.0209
+INFO:local_logger:Epoch[004/300], Step[0100/1602], Avg Loss: 6.2064, Avg Acc: 0.0244
+INFO:local_logger:Epoch[004/300], Step[0100/1602], Avg Loss: 6.2162, Avg Acc: 0.0231
+INFO:master_logger:Epoch[004/300], Step[0100/1602], Avg Loss: 6.1987, Avg Acc: 0.0232
+INFO:local_logger:Epoch[004/300], Step[0150/1602], Avg Loss: 6.1715, Avg Acc: 0.0240
+INFO:local_logger:Epoch[004/300], Step[0150/1602], Avg Loss: 6.1949, Avg Acc: 0.0248
+INFO:local_logger:Epoch[004/300], Step[0150/1602], Avg Loss: 6.1964, Avg Acc: 0.0227
+INFO:master_logger:Epoch[004/300], Step[0150/1602], Avg Loss: 6.1835, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0150/1602], Avg Loss: 6.1712, Avg Acc: 0.0229
+INFO:local_logger:Epoch[004/300], Step[0200/1602], Avg Loss: 6.1811, Avg Acc: 0.0238
+INFO:local_logger:Epoch[004/300], Step[0200/1602], Avg Loss: 6.1906, Avg Acc: 0.0235
+INFO:local_logger:Epoch[004/300], Step[0200/1602], Avg Loss: 6.1956, Avg Acc: 0.0222
+INFO:local_logger:Epoch[004/300], Step[0200/1602], Avg Loss: 6.1797, Avg Acc: 0.0232
+INFO:master_logger:Epoch[004/300], Step[0200/1602], Avg Loss: 6.1867, Avg Acc: 0.0232
+INFO:local_logger:Epoch[004/300], Step[0250/1602], Avg Loss: 6.1991, Avg Acc: 0.0228
+INFO:local_logger:Epoch[004/300], Step[0250/1602], Avg Loss: 6.1802, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0250/1602], Avg Loss: 6.2012, Avg Acc: 0.0218
+INFO:local_logger:Epoch[004/300], Step[0250/1602], Avg Loss: 6.1847, Avg Acc: 0.0230
+INFO:master_logger:Epoch[004/300], Step[0250/1602], Avg Loss: 6.1913, Avg Acc: 0.0228
+INFO:local_logger:Epoch[004/300], Step[0300/1602], Avg Loss: 6.1826, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0300/1602], Avg Loss: 6.1874, Avg Acc: 0.0233
+INFO:local_logger:Epoch[004/300], Step[0300/1602], Avg Loss: 6.1956, Avg Acc: 0.0219
+INFO:local_logger:Epoch[004/300], Step[0300/1602], Avg Loss: 6.1801, Avg Acc: 0.0228
+INFO:master_logger:Epoch[004/300], Step[0300/1602], Avg Loss: 6.1864, Avg Acc: 0.0229
+INFO:local_logger:Epoch[004/300], Step[0350/1602], Avg Loss: 6.1909, Avg Acc: 0.0222
+INFO:local_logger:Epoch[004/300], Step[0350/1602], Avg Loss: 6.1868, Avg Acc: 0.0234
+INFO:local_logger:Epoch[004/300], Step[0350/1602], Avg Loss: 6.1824, Avg Acc: 0.0233
+INFO:local_logger:Epoch[004/300], Step[0350/1602], Avg Loss: 6.1781, Avg Acc: 0.0240
+INFO:master_logger:Epoch[004/300], Step[0350/1602], Avg Loss: 6.1846, Avg Acc: 0.0232
+INFO:local_logger:Epoch[004/300], Step[0400/1602], Avg Loss: 6.1852, Avg Acc: 0.0228
+INFO:local_logger:Epoch[004/300], Step[0400/1602], Avg Loss: 6.1778, Avg Acc: 0.0238
+INFO:local_logger:Epoch[004/300], Step[0400/1602], Avg Loss: 6.1787, Avg Acc: 0.0227
+INFO:master_logger:Epoch[004/300], Step[0400/1602], Avg Loss: 6.1812, Avg Acc: 0.0229
+INFO:local_logger:Epoch[004/300], Step[0400/1602], Avg Loss: 6.1832, Avg Acc: 0.0223
+INFO:local_logger:Epoch[004/300], Step[0450/1602], Avg Loss: 6.1785, Avg Acc: 0.0225
+INFO:local_logger:Epoch[004/300], Step[0450/1602], Avg Loss: 6.1747, Avg Acc: 0.0226
+INFO:local_logger:Epoch[004/300], Step[0450/1602], Avg Loss: 6.1836, Avg Acc: 0.0229
+INFO:local_logger:Epoch[004/300], Step[0450/1602], Avg Loss: 6.1696, Avg Acc: 0.0240
+INFO:master_logger:Epoch[004/300], Step[0450/1602], Avg Loss: 6.1766, Avg Acc: 0.0230
+INFO:local_logger:Epoch[004/300], Step[0500/1602], Avg Loss: 6.1711, Avg Acc: 0.0227
+INFO:local_logger:Epoch[004/300], Step[0500/1602], Avg Loss: 6.1780, Avg Acc: 0.0229
+INFO:local_logger:Epoch[004/300], Step[0500/1602], Avg Loss: 6.1796, Avg Acc: 0.0231
+INFO:local_logger:Epoch[004/300], Step[0500/1602], Avg Loss: 6.1699, Avg Acc: 0.0242
+INFO:master_logger:Epoch[004/300], Step[0500/1602], Avg Loss: 6.1747, Avg Acc: 0.0232
+INFO:local_logger:Epoch[004/300], Step[0550/1602], Avg Loss: 6.1682, Avg Acc: 0.0231
+INFO:local_logger:Epoch[004/300], Step[0550/1602], Avg Loss: 6.1767, Avg Acc: 0.0233
+INFO:local_logger:Epoch[004/300], Step[0550/1602], Avg Loss: 6.1735, Avg Acc: 0.0229
+INFO:local_logger:Epoch[004/300], Step[0550/1602], Avg Loss: 6.1665, Avg Acc: 0.0238
+INFO:master_logger:Epoch[004/300], Step[0550/1602], Avg Loss: 6.1712, Avg Acc: 0.0233
+INFO:local_logger:Epoch[004/300], Step[0600/1602], Avg Loss: 6.1693, Avg Acc: 0.0234
+INFO:local_logger:Epoch[004/300], Step[0600/1602], Avg Loss: 6.1631, Avg Acc: 0.0237
+INFO:local_logger:Epoch[004/300], Step[0600/1602], Avg Loss: 6.1638, Avg Acc: 0.0238
+INFO:local_logger:Epoch[004/300], Step[0600/1602], Avg Loss: 6.1728, Avg Acc: 0.0228
+INFO:master_logger:Epoch[004/300], Step[0600/1602], Avg Loss: 6.1672, Avg Acc: 0.0234
+INFO:local_logger:Epoch[004/300], Step[0650/1602], Avg Loss: 6.1616, Avg Acc: 0.0235
+INFO:local_logger:Epoch[004/300], Step[0650/1602], Avg Loss: 6.1709, Avg Acc: 0.0228
+INFO:local_logger:Epoch[004/300], Step[0650/1602], Avg Loss: 6.1703, Avg Acc: 0.0237
+INFO:local_logger:Epoch[004/300], Step[0650/1602], Avg Loss: 6.1698, Avg Acc: 0.0234
+INFO:master_logger:Epoch[004/300], Step[0650/1602], Avg Loss: 6.1681, Avg Acc: 0.0233
+INFO:local_logger:Epoch[004/300], Step[0700/1602], Avg Loss: 6.1573, Avg Acc: 0.0237
+INFO:local_logger:Epoch[004/300], Step[0700/1602], Avg Loss: 6.1646, Avg Acc: 0.0230
+INFO:local_logger:Epoch[004/300], Step[0700/1602], Avg Loss: 6.1664, Avg Acc: 0.0233
+INFO:local_logger:Epoch[004/300], Step[0700/1602], Avg Loss: 6.1704, Avg Acc: 0.0228
+INFO:master_logger:Epoch[004/300], Step[0700/1602], Avg Loss: 6.1647, Avg Acc: 0.0232
+INFO:local_logger:Epoch[004/300], Step[0750/1602], Avg Loss: 6.1545, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0750/1602], Avg Loss: 6.1614, Avg Acc: 0.0240
+INFO:local_logger:Epoch[004/300], Step[0750/1602], Avg Loss: 6.1666, Avg Acc: 0.0231
+INFO:master_logger:Epoch[004/300], Step[0750/1602], Avg Loss: 6.1615, Avg Acc: 0.0234
+INFO:local_logger:Epoch[004/300], Step[0750/1602], Avg Loss: 6.1635, Avg Acc: 0.0230
+INFO:local_logger:Epoch[004/300], Step[0800/1602], Avg Loss: 6.1653, Avg Acc: 0.0231
+INFO:local_logger:Epoch[004/300], Step[0800/1602], Avg Loss: 6.1538, Avg Acc: 0.0237
+INFO:local_logger:Epoch[004/300], Step[0800/1602], Avg Loss: 6.1618, Avg Acc: 0.0232
+INFO:local_logger:Epoch[004/300], Step[0800/1602], Avg Loss: 6.1575, Avg Acc: 0.0244
+INFO:master_logger:Epoch[004/300], Step[0800/1602], Avg Loss: 6.1596, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0850/1602], Avg Loss: 6.1519, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0850/1602], Avg Loss: 6.1573, Avg Acc: 0.0236
+INFO:local_logger:Epoch[004/300], Step[0850/1602], Avg Loss: 6.1546, Avg Acc: 0.0247
+INFO:master_logger:Epoch[004/300], Step[0850/1602], Avg Loss: 6.1563, Avg Acc: 0.0238
+INFO:local_logger:Epoch[004/300], Step[0850/1602], Avg Loss: 6.1613, Avg Acc: 0.0234
+INFO:local_logger:Epoch[004/300], Step[0900/1602], Avg Loss: 6.1490, Avg Acc: 0.0237
+INFO:local_logger:Epoch[004/300], Step[0900/1602], Avg Loss: 6.1549, Avg Acc: 0.0238
+INFO:local_logger:Epoch[004/300], Step[0900/1602], Avg Loss: 6.1485, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[0900/1602], Avg Loss: 6.1598, Avg Acc: 0.0236
+INFO:master_logger:Epoch[004/300], Step[0900/1602], Avg Loss: 6.1531, Avg Acc: 0.0240
+INFO:local_logger:Epoch[004/300], Step[0950/1602], Avg Loss: 6.1434, Avg Acc: 0.0240
+INFO:local_logger:Epoch[004/300], Step[0950/1602], Avg Loss: 6.1476, Avg Acc: 0.0244
+INFO:local_logger:Epoch[004/300], Step[0950/1602], Avg Loss: 6.1561, Avg Acc: 0.0239
+INFO:master_logger:Epoch[004/300], Step[0950/1602], Avg Loss: 6.1476, Avg Acc: 0.0244
+INFO:local_logger:Epoch[004/300], Step[0950/1602], Avg Loss: 6.1432, Avg Acc: 0.0251
+INFO:local_logger:Epoch[004/300], Step[1000/1602], Avg Loss: 6.1460, Avg Acc: 0.0243
+INFO:local_logger:Epoch[004/300], Step[1000/1602], Avg Loss: 6.1405, Avg Acc: 0.0244
+INFO:local_logger:Epoch[004/300], Step[1000/1602], Avg Loss: 6.1417, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[1000/1602], Avg Loss: 6.1525, Avg Acc: 0.0241
+INFO:master_logger:Epoch[004/300], Step[1000/1602], Avg Loss: 6.1452, Avg Acc: 0.0245
+INFO:local_logger:Epoch[004/300], Step[1050/1602], Avg Loss: 6.1366, Avg Acc: 0.0247
+INFO:local_logger:Epoch[004/300], Step[1050/1602], Avg Loss: 6.1509, Avg Acc: 0.0244
+INFO:local_logger:Epoch[004/300], Step[1050/1602], Avg Loss: 6.1411, Avg Acc: 0.0250
+INFO:master_logger:Epoch[004/300], Step[1050/1602], Avg Loss: 6.1427, Avg Acc: 0.0247
+INFO:local_logger:Epoch[004/300], Step[1050/1602], Avg Loss: 6.1421, Avg Acc: 0.0245
+INFO:local_logger:Epoch[004/300], Step[1100/1602], Avg Loss: 6.1338, Avg Acc: 0.0246
+INFO:local_logger:Epoch[004/300], Step[1100/1602], Avg Loss: 6.1376, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[1100/1602], Avg Loss: 6.1406, Avg Acc: 0.0246
+INFO:master_logger:Epoch[004/300], Step[1100/1602], Avg Loss: 6.1398, Avg Acc: 0.0247
+INFO:local_logger:Epoch[004/300], Step[1100/1602], Avg Loss: 6.1473, Avg Acc: 0.0245
+INFO:local_logger:Epoch[004/300], Step[1150/1602], Avg Loss: 6.1363, Avg Acc: 0.0249
+INFO:local_logger:Epoch[004/300], Step[1150/1602], Avg Loss: 6.1310, Avg Acc: 0.0247
+INFO:local_logger:Epoch[004/300], Step[1150/1602], Avg Loss: 6.1420, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[1150/1602], Avg Loss: 6.1325, Avg Acc: 0.0253
+INFO:master_logger:Epoch[004/300], Step[1150/1602], Avg Loss: 6.1354, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[1200/1602], Avg Loss: 6.1287, Avg Acc: 0.0248
+INFO:local_logger:Epoch[004/300], Step[1200/1602], Avg Loss: 6.1345, Avg Acc: 0.0249
+INFO:local_logger:Epoch[004/300], Step[1200/1602], Avg Loss: 6.1368, Avg Acc: 0.0251
+INFO:local_logger:Epoch[004/300], Step[1200/1602], Avg Loss: 6.1331, Avg Acc: 0.0254
+INFO:master_logger:Epoch[004/300], Step[1200/1602], Avg Loss: 6.1333, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[1250/1602], Avg Loss: 6.1256, Avg Acc: 0.0251
+INFO:local_logger:Epoch[004/300], Step[1250/1602], Avg Loss: 6.1301, Avg Acc: 0.0257
+INFO:local_logger:Epoch[004/300], Step[1250/1602], Avg Loss: 6.1312, Avg Acc: 0.0251
+INFO:local_logger:Epoch[004/300], Step[1250/1602], Avg Loss: 6.1323, Avg Acc: 0.0253
+INFO:master_logger:Epoch[004/300], Step[1250/1602], Avg Loss: 6.1298, Avg Acc: 0.0253
+INFO:local_logger:Epoch[004/300], Step[1300/1602], Avg Loss: 6.1229, Avg Acc: 0.0253
+INFO:local_logger:Epoch[004/300], Step[1300/1602], Avg Loss: 6.1298, Avg Acc: 0.0251
+INFO:local_logger:Epoch[004/300], Step[1300/1602], Avg Loss: 6.1279, Avg Acc: 0.0260
+INFO:master_logger:Epoch[004/300], Step[1300/1602], Avg Loss: 6.1278, Avg Acc: 0.0254
+INFO:local_logger:Epoch[004/300], Step[1300/1602], Avg Loss: 6.1307, Avg Acc: 0.0254
+INFO:local_logger:Epoch[004/300], Step[1350/1602], Avg Loss: 6.1203, Avg Acc: 0.0255
+INFO:master_logger:Epoch[004/300], Step[1350/1602], Avg Loss: 6.1253, Avg Acc: 0.0255
+INFO:local_logger:Epoch[004/300], Step[1350/1602], Avg Loss: 6.1273, Avg Acc: 0.0255
+INFO:local_logger:Epoch[004/300], Step[1350/1602], Avg Loss: 6.1264, Avg Acc: 0.0260
+INFO:local_logger:Epoch[004/300], Step[1350/1602], Avg Loss: 6.1270, Avg Acc: 0.0250
+INFO:local_logger:Epoch[004/300], Step[1400/1602], Avg Loss: 6.1197, Avg Acc: 0.0255
+INFO:local_logger:Epoch[004/300], Step[1400/1602], Avg Loss: 6.1243, Avg Acc: 0.0258
+INFO:local_logger:Epoch[004/300], Step[1400/1602], Avg Loss: 6.1232, Avg Acc: 0.0263
+INFO:local_logger:Epoch[004/300], Step[1400/1602], Avg Loss: 6.1240, Avg Acc: 0.0250
+INFO:master_logger:Epoch[004/300], Step[1400/1602], Avg Loss: 6.1228, Avg Acc: 0.0257
+INFO:local_logger:Epoch[004/300], Step[1450/1602], Avg Loss: 6.1217, Avg Acc: 0.0266
+INFO:local_logger:Epoch[004/300], Step[1450/1602], Avg Loss: 6.1184, Avg Acc: 0.0255
+INFO:local_logger:Epoch[004/300], Step[1450/1602], Avg Loss: 6.1202, Avg Acc: 0.0252
+INFO:local_logger:Epoch[004/300], Step[1450/1602], Avg Loss: 6.1214, Avg Acc: 0.0260
+INFO:master_logger:Epoch[004/300], Step[1450/1602], Avg Loss: 6.1204, Avg Acc: 0.0258
+INFO:local_logger:Epoch[004/300], Step[1500/1602], Avg Loss: 6.1147, Avg Acc: 0.0257
+INFO:local_logger:Epoch[004/300], Step[1500/1602], Avg Loss: 6.1168, Avg Acc: 0.0262
+INFO:local_logger:Epoch[004/300], Step[1500/1602], Avg Loss: 6.1168, Avg Acc: 0.0253
+INFO:master_logger:Epoch[004/300], Step[1500/1602], Avg Loss: 6.1169, Avg Acc: 0.0259
+INFO:local_logger:Epoch[004/300], Step[1500/1602], Avg Loss: 6.1195, Avg Acc: 0.0265
+INFO:local_logger:Epoch[004/300], Step[1550/1602], Avg Loss: 6.1116, Avg Acc: 0.0259
+INFO:local_logger:Epoch[004/300], Step[1550/1602], Avg Loss: 6.1113, Avg Acc: 0.0257
+INFO:local_logger:Epoch[004/300], Step[1550/1602], Avg Loss: 6.1134, Avg Acc: 0.0264
+INFO:local_logger:Epoch[004/300], Step[1550/1602], Avg Loss: 6.1164, Avg Acc: 0.0266
+INFO:master_logger:Epoch[004/300], Step[1550/1602], Avg Loss: 6.1132, Avg Acc: 0.0262
+INFO:local_logger:Epoch[004/300], Step[1600/1602], Avg Loss: 6.1099, Avg Acc: 0.0266
+INFO:local_logger:Epoch[004/300], Step[1600/1602], Avg Loss: 6.1132, Avg Acc: 0.0268
+INFO:local_logger:Epoch[004/300], Step[1600/1602], Avg Loss: 6.1094, Avg Acc: 0.0261
+INFO:master_logger:Epoch[004/300], Step[1600/1602], Avg Loss: 6.1102, Avg Acc: 0.0263
+INFO:local_logger:Epoch[004/300], Step[1600/1602], Avg Loss: 6.1083, Avg Acc: 0.0259
+INFO:local_logger:----- Epoch[004/300], Train Loss: 6.1100, Train Acc: 0.0266, time: 3705.85
+INFO:local_logger:Now training epoch 5. LR=0.000391
+INFO:local_logger:----- Epoch[004/300], Train Loss: 6.1095, Train Acc: 0.0261, time: 3705.98
+INFO:master_logger:----- Epoch[004/300], Train Loss: 6.1102, Train Acc: 0.0263, time: 3705.98
+INFO:local_logger:----- Epoch[004/300], Train Loss: 6.1132, Train Acc: 0.0268, time: 3706.08
+INFO:local_logger:Now training epoch 5. LR=0.000391
+INFO:local_logger:----- Epoch[004/300], Train Loss: 6.1082, Train Acc: 0.0259, time: 3706.09
+INFO:local_logger:Now training epoch 5. LR=0.000391
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-4-Loss-6.109469821331523.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-4-Loss-6.109469821331523.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-4-Loss-6.109469821331523-EMA.pdparams
+INFO:local_logger:Now training epoch 5. LR=0.000391
+INFO:master_logger:Now training epoch 5. LR=0.000391
+INFO:local_logger:Epoch[005/300], Step[0000/1602], Avg Loss: 5.8093, Avg Acc: 0.0750
+INFO:local_logger:Epoch[005/300], Step[0000/1602], Avg Loss: 5.8639, Avg Acc: 0.0000
+INFO:master_logger:Epoch[005/300], Step[0000/1602], Avg Loss: 5.7856, Avg Acc: 0.0300
+INFO:local_logger:Epoch[005/300], Step[0000/1602], Avg Loss: 5.6513, Avg Acc: 0.0400
+INFO:local_logger:Epoch[005/300], Step[0000/1602], Avg Loss: 5.8179, Avg Acc: 0.0050
+INFO:local_logger:Epoch[005/300], Step[0050/1602], Avg Loss: 6.0528, Avg Acc: 0.0277
+INFO:local_logger:Epoch[005/300], Step[0050/1602], Avg Loss: 6.0296, Avg Acc: 0.0325
+INFO:local_logger:Epoch[005/300], Step[0050/1602], Avg Loss: 5.9977, Avg Acc: 0.0325
+INFO:local_logger:Epoch[005/300], Step[0050/1602], Avg Loss: 5.9961, Avg Acc: 0.0323
+INFO:master_logger:Epoch[005/300], Step[0050/1602], Avg Loss: 6.0190, Avg Acc: 0.0312
+INFO:local_logger:Epoch[005/300], Step[0100/1602], Avg Loss: 6.0447, Avg Acc: 0.0303
+INFO:local_logger:Epoch[005/300], Step[0100/1602], Avg Loss: 6.0536, Avg Acc: 0.0307
+INFO:master_logger:Epoch[005/300], Step[0100/1602], Avg Loss: 6.0386, Avg Acc: 0.0308
+INFO:local_logger:Epoch[005/300], Step[0100/1602], Avg Loss: 6.0337, Avg Acc: 0.0306
+INFO:local_logger:Epoch[005/300], Step[0100/1602], Avg Loss: 6.0224, Avg Acc: 0.0318
+INFO:local_logger:Epoch[005/300], Step[0150/1602], Avg Loss: 6.0391, Avg Acc: 0.0324
+INFO:master_logger:Epoch[005/300], Step[0150/1602], Avg Loss: 6.0280, Avg Acc: 0.0318
+INFO:local_logger:Epoch[005/300], Step[0150/1602], Avg Loss: 6.0261, Avg Acc: 0.0325
+INFO:local_logger:Epoch[005/300], Step[0150/1602], Avg Loss: 6.0094, Avg Acc: 0.0322
+INFO:local_logger:Epoch[005/300], Step[0150/1602], Avg Loss: 6.0374, Avg Acc: 0.0302
+INFO:local_logger:Epoch[005/300], Step[0200/1602], Avg Loss: 6.0496, Avg Acc: 0.0319
+INFO:local_logger:Epoch[005/300], Step[0200/1602], Avg Loss: 6.0356, Avg Acc: 0.0305
+INFO:master_logger:Epoch[005/300], Step[0200/1602], Avg Loss: 6.0284, Avg Acc: 0.0318
+INFO:local_logger:Epoch[005/300], Step[0200/1602], Avg Loss: 6.0033, Avg Acc: 0.0326
+INFO:local_logger:Epoch[005/300], Step[0200/1602], Avg Loss: 6.0252, Avg Acc: 0.0321
+INFO:local_logger:Epoch[005/300], Step[0250/1602], Avg Loss: 5.9866, Avg Acc: 0.0332
+INFO:local_logger:Epoch[005/300], Step[0250/1602], Avg Loss: 6.0387, Avg Acc: 0.0327
+INFO:master_logger:Epoch[005/300], Step[0250/1602], Avg Loss: 6.0215, Avg Acc: 0.0318
+INFO:local_logger:Epoch[005/300], Step[0250/1602], Avg Loss: 6.0239, Avg Acc: 0.0304
+INFO:local_logger:Epoch[005/300], Step[0250/1602], Avg Loss: 6.0365, Avg Acc: 0.0306
+INFO:local_logger:Epoch[005/300], Step[0300/1602], Avg Loss: 6.0296, Avg Acc: 0.0324
+INFO:local_logger:Epoch[005/300], Step[0300/1602], Avg Loss: 6.0210, Avg Acc: 0.0303
+INFO:local_logger:Epoch[005/300], Step[0300/1602], Avg Loss: 6.0477, Avg Acc: 0.0297
+INFO:local_logger:Epoch[005/300], Step[0300/1602], Avg Loss: 5.9915, Avg Acc: 0.0341
+INFO:master_logger:Epoch[005/300], Step[0300/1602], Avg Loss: 6.0225, Avg Acc: 0.0316
+INFO:local_logger:Epoch[005/300], Step[0350/1602], Avg Loss: 6.0140, Avg Acc: 0.0331
+INFO:local_logger:Epoch[005/300], Step[0350/1602], Avg Loss: 6.0243, Avg Acc: 0.0298
+INFO:local_logger:Epoch[005/300], Step[0350/1602], Avg Loss: 6.0408, Avg Acc: 0.0299
+INFO:master_logger:Epoch[005/300], Step[0350/1602], Avg Loss: 6.0190, Avg Acc: 0.0316
+INFO:local_logger:Epoch[005/300], Step[0350/1602], Avg Loss: 5.9967, Avg Acc: 0.0335
+INFO:local_logger:Epoch[005/300], Step[0400/1602], Avg Loss: 5.9894, Avg Acc: 0.0341
+INFO:local_logger:Epoch[005/300], Step[0400/1602], Avg Loss: 6.0195, Avg Acc: 0.0332
+INFO:local_logger:Epoch[005/300], Step[0400/1602], Avg Loss: 6.0350, Avg Acc: 0.0297
+INFO:local_logger:Epoch[005/300], Step[0400/1602], Avg Loss: 6.0270, Avg Acc: 0.0298
+INFO:master_logger:Epoch[005/300], Step[0400/1602], Avg Loss: 6.0177, Avg Acc: 0.0317
+INFO:local_logger:Epoch[005/300], Step[0450/1602], Avg Loss: 6.0311, Avg Acc: 0.0298
+INFO:local_logger:Epoch[005/300], Step[0450/1602], Avg Loss: 6.0168, Avg Acc: 0.0333
+INFO:local_logger:Epoch[005/300], Step[0450/1602], Avg Loss: 6.0256, Avg Acc: 0.0303
+INFO:local_logger:Epoch[005/300], Step[0450/1602], Avg Loss: 5.9882, Avg Acc: 0.0342
+INFO:master_logger:Epoch[005/300], Step[0450/1602], Avg Loss: 6.0154, Avg Acc: 0.0319
+INFO:local_logger:Epoch[005/300], Step[0500/1602], Avg Loss: 6.0113, Avg Acc: 0.0340
+INFO:local_logger:Epoch[005/300], Step[0500/1602], Avg Loss: 6.0237, Avg Acc: 0.0302
+INFO:local_logger:Epoch[005/300], Step[0500/1602], Avg Loss: 6.0230, Avg Acc: 0.0306
+INFO:local_logger:Epoch[005/300], Step[0500/1602], Avg Loss: 5.9863, Avg Acc: 0.0343
+INFO:master_logger:Epoch[005/300], Step[0500/1602], Avg Loss: 6.0111, Avg Acc: 0.0323
+INFO:local_logger:Epoch[005/300], Step[0550/1602], Avg Loss: 6.0037, Avg Acc: 0.0340
+INFO:local_logger:Epoch[005/300], Step[0550/1602], Avg Loss: 6.0222, Avg Acc: 0.0309
+INFO:local_logger:Epoch[005/300], Step[0550/1602], Avg Loss: 6.0109, Avg Acc: 0.0306
+INFO:local_logger:Epoch[005/300], Step[0550/1602], Avg Loss: 5.9800, Avg Acc: 0.0343
+INFO:master_logger:Epoch[005/300], Step[0550/1602], Avg Loss: 6.0042, Avg Acc: 0.0325
+INFO:local_logger:Epoch[005/300], Step[0600/1602], Avg Loss: 6.0021, Avg Acc: 0.0343
+INFO:local_logger:Epoch[005/300], Step[0600/1602], Avg Loss: 5.9763, Avg Acc: 0.0339
+INFO:local_logger:Epoch[005/300], Step[0600/1602], Avg Loss: 6.0199, Avg Acc: 0.0315
+INFO:local_logger:Epoch[005/300], Step[0600/1602], Avg Loss: 6.0054, Avg Acc: 0.0310
+INFO:master_logger:Epoch[005/300], Step[0600/1602], Avg Loss: 6.0009, Avg Acc: 0.0327
+INFO:local_logger:Epoch[005/300], Step[0650/1602], Avg Loss: 6.0038, Avg Acc: 0.0314
+INFO:local_logger:Epoch[005/300], Step[0650/1602], Avg Loss: 6.0026, Avg Acc: 0.0342
+INFO:local_logger:Epoch[005/300], Step[0650/1602], Avg Loss: 5.9742, Avg Acc: 0.0342
+INFO:local_logger:Epoch[005/300], Step[0650/1602], Avg Loss: 6.0165, Avg Acc: 0.0316
+INFO:master_logger:Epoch[005/300], Step[0650/1602], Avg Loss: 5.9993, Avg Acc: 0.0328
+INFO:local_logger:Epoch[005/300], Step[0700/1602], Avg Loss: 6.0047, Avg Acc: 0.0336
+INFO:local_logger:Epoch[005/300], Step[0700/1602], Avg Loss: 5.9974, Avg Acc: 0.0311
+INFO:local_logger:Epoch[005/300], Step[0700/1602], Avg Loss: 6.0129, Avg Acc: 0.0319
+INFO:local_logger:Epoch[005/300], Step[0700/1602], Avg Loss: 5.9785, Avg Acc: 0.0334
+INFO:master_logger:Epoch[005/300], Step[0700/1602], Avg Loss: 5.9984, Avg Acc: 0.0325
+INFO:local_logger:Epoch[005/300], Step[0750/1602], Avg Loss: 6.0004, Avg Acc: 0.0335
+INFO:local_logger:Epoch[005/300], Step[0750/1602], Avg Loss: 5.9952, Avg Acc: 0.0318
+INFO:local_logger:Epoch[005/300], Step[0750/1602], Avg Loss: 6.0088, Avg Acc: 0.0319
+INFO:local_logger:Epoch[005/300], Step[0750/1602], Avg Loss: 5.9708, Avg Acc: 0.0340
+INFO:master_logger:Epoch[005/300], Step[0750/1602], Avg Loss: 5.9938, Avg Acc: 0.0328
+INFO:local_logger:Epoch[005/300], Step[0800/1602], Avg Loss: 5.9992, Avg Acc: 0.0335
+INFO:local_logger:Epoch[005/300], Step[0800/1602], Avg Loss: 5.9904, Avg Acc: 0.0319
+INFO:local_logger:Epoch[005/300], Step[0800/1602], Avg Loss: 6.0021, Avg Acc: 0.0322
+INFO:master_logger:Epoch[005/300], Step[0800/1602], Avg Loss: 5.9900, Avg Acc: 0.0331
+INFO:local_logger:Epoch[005/300], Step[0800/1602], Avg Loss: 5.9685, Avg Acc: 0.0346
+INFO:local_logger:Epoch[005/300], Step[0850/1602], Avg Loss: 5.9984, Avg Acc: 0.0334
+INFO:local_logger:Epoch[005/300], Step[0850/1602], Avg Loss: 5.9853, Avg Acc: 0.0325
+INFO:local_logger:Epoch[005/300], Step[0850/1602], Avg Loss: 5.9676, Avg Acc: 0.0347
+INFO:local_logger:Epoch[005/300], Step[0850/1602], Avg Loss: 5.9997, Avg Acc: 0.0325
+INFO:master_logger:Epoch[005/300], Step[0850/1602], Avg Loss: 5.9877, Avg Acc: 0.0333
+INFO:local_logger:Epoch[005/300], Step[0900/1602], Avg Loss: 5.9927, Avg Acc: 0.0334
+INFO:local_logger:Epoch[005/300], Step[0900/1602], Avg Loss: 5.9820, Avg Acc: 0.0328
+INFO:local_logger:Epoch[005/300], Step[0900/1602], Avg Loss: 5.9959, Avg Acc: 0.0331
+INFO:master_logger:Epoch[005/300], Step[0900/1602], Avg Loss: 5.9840, Avg Acc: 0.0335
+INFO:local_logger:Epoch[005/300], Step[0900/1602], Avg Loss: 5.9653, Avg Acc: 0.0347
+INFO:local_logger:Epoch[005/300], Step[0950/1602], Avg Loss: 5.9598, Avg Acc: 0.0348
+INFO:local_logger:Epoch[005/300], Step[0950/1602], Avg Loss: 5.9881, Avg Acc: 0.0338
+INFO:local_logger:Epoch[005/300], Step[0950/1602], Avg Loss: 5.9926, Avg Acc: 0.0332
+INFO:master_logger:Epoch[005/300], Step[0950/1602], Avg Loss: 5.9796, Avg Acc: 0.0336
+INFO:local_logger:Epoch[005/300], Step[0950/1602], Avg Loss: 5.9777, Avg Acc: 0.0328
+INFO:local_logger:Epoch[005/300], Step[1000/1602], Avg Loss: 5.9865, Avg Acc: 0.0340
+INFO:local_logger:Epoch[005/300], Step[1000/1602], Avg Loss: 5.9764, Avg Acc: 0.0329
+INFO:local_logger:Epoch[005/300], Step[1000/1602], Avg Loss: 5.9583, Avg Acc: 0.0350
+INFO:local_logger:Epoch[005/300], Step[1000/1602], Avg Loss: 5.9845, Avg Acc: 0.0335
+INFO:master_logger:Epoch[005/300], Step[1000/1602], Avg Loss: 5.9764, Avg Acc: 0.0339
+INFO:local_logger:Epoch[005/300], Step[1050/1602], Avg Loss: 5.9810, Avg Acc: 0.0342
+INFO:local_logger:Epoch[005/300], Step[1050/1602], Avg Loss: 5.9727, Avg Acc: 0.0333
+INFO:local_logger:Epoch[005/300], Step[1050/1602], Avg Loss: 5.9810, Avg Acc: 0.0340
+INFO:local_logger:Epoch[005/300], Step[1050/1602], Avg Loss: 5.9549, Avg Acc: 0.0350
+INFO:master_logger:Epoch[005/300], Step[1050/1602], Avg Loss: 5.9724, Avg Acc: 0.0341
+INFO:local_logger:Epoch[005/300], Step[1100/1602], Avg Loss: 5.9791, Avg Acc: 0.0344
+INFO:local_logger:Epoch[005/300], Step[1100/1602], Avg Loss: 5.9757, Avg Acc: 0.0342
+INFO:local_logger:Epoch[005/300], Step[1100/1602], Avg Loss: 5.9718, Avg Acc: 0.0336
+INFO:local_logger:Epoch[005/300], Step[1100/1602], Avg Loss: 5.9517, Avg Acc: 0.0351
+INFO:master_logger:Epoch[005/300], Step[1100/1602], Avg Loss: 5.9696, Avg Acc: 0.0343
+INFO:local_logger:Epoch[005/300], Step[1150/1602], Avg Loss: 5.9735, Avg Acc: 0.0348
+INFO:local_logger:Epoch[005/300], Step[1150/1602], Avg Loss: 5.9727, Avg Acc: 0.0336
+INFO:master_logger:Epoch[005/300], Step[1150/1602], Avg Loss: 5.9676, Avg Acc: 0.0345
+INFO:local_logger:Epoch[005/300], Step[1150/1602], Avg Loss: 5.9486, Avg Acc: 0.0353
+INFO:local_logger:Epoch[005/300], Step[1150/1602], Avg Loss: 5.9756, Avg Acc: 0.0344
+INFO:local_logger:Epoch[005/300], Step[1200/1602], Avg Loss: 5.9749, Avg Acc: 0.0347
+INFO:local_logger:Epoch[005/300], Step[1200/1602], Avg Loss: 5.9708, Avg Acc: 0.0351
+INFO:local_logger:Epoch[005/300], Step[1200/1602], Avg Loss: 5.9711, Avg Acc: 0.0339
+INFO:local_logger:Epoch[005/300], Step[1200/1602], Avg Loss: 5.9478, Avg Acc: 0.0353
+INFO:master_logger:Epoch[005/300], Step[1200/1602], Avg Loss: 5.9662, Avg Acc: 0.0348
+INFO:local_logger:Epoch[005/300], Step[1250/1602], Avg Loss: 5.9685, Avg Acc: 0.0352
+INFO:local_logger:Epoch[005/300], Step[1250/1602], Avg Loss: 5.9693, Avg Acc: 0.0351
+INFO:local_logger:Epoch[005/300], Step[1250/1602], Avg Loss: 5.9680, Avg Acc: 0.0340
+INFO:local_logger:Epoch[005/300], Step[1250/1602], Avg Loss: 5.9443, Avg Acc: 0.0354
+INFO:master_logger:Epoch[005/300], Step[1250/1602], Avg Loss: 5.9625, Avg Acc: 0.0349
+INFO:local_logger:Epoch[005/300], Step[1300/1602], Avg Loss: 5.9633, Avg Acc: 0.0357
+INFO:local_logger:Epoch[005/300], Step[1300/1602], Avg Loss: 5.9637, Avg Acc: 0.0342
+INFO:local_logger:Epoch[005/300], Step[1300/1602], Avg Loss: 5.9417, Avg Acc: 0.0354
+INFO:local_logger:Epoch[005/300], Step[1300/1602], Avg Loss: 5.9700, Avg Acc: 0.0351
+INFO:master_logger:Epoch[005/300], Step[1300/1602], Avg Loss: 5.9597, Avg Acc: 0.0351
+INFO:local_logger:Epoch[005/300], Step[1350/1602], Avg Loss: 5.9593, Avg Acc: 0.0359
+INFO:local_logger:Epoch[005/300], Step[1350/1602], Avg Loss: 5.9690, Avg Acc: 0.0352
+INFO:local_logger:Epoch[005/300], Step[1350/1602], Avg Loss: 5.9624, Avg Acc: 0.0343
+INFO:local_logger:Epoch[005/300], Step[1350/1602], Avg Loss: 5.9409, Avg Acc: 0.0356
+INFO:master_logger:Epoch[005/300], Step[1350/1602], Avg Loss: 5.9579, Avg Acc: 0.0353
+INFO:local_logger:Epoch[005/300], Step[1400/1602], Avg Loss: 5.9380, Avg Acc: 0.0360
+INFO:local_logger:Epoch[005/300], Step[1400/1602], Avg Loss: 5.9568, Avg Acc: 0.0360
+INFO:local_logger:Epoch[005/300], Step[1400/1602], Avg Loss: 5.9634, Avg Acc: 0.0358
+INFO:local_logger:Epoch[005/300], Step[1400/1602], Avg Loss: 5.9601, Avg Acc: 0.0346
+INFO:master_logger:Epoch[005/300], Step[1400/1602], Avg Loss: 5.9546, Avg Acc: 0.0356
+INFO:local_logger:Epoch[005/300], Step[1450/1602], Avg Loss: 5.9551, Avg Acc: 0.0363
+INFO:local_logger:Epoch[005/300], Step[1450/1602], Avg Loss: 5.9559, Avg Acc: 0.0345
+INFO:local_logger:Epoch[005/300], Step[1450/1602], Avg Loss: 5.9594, Avg Acc: 0.0362
+INFO:local_logger:Epoch[005/300], Step[1450/1602], Avg Loss: 5.9362, Avg Acc: 0.0363
+INFO:master_logger:Epoch[005/300], Step[1450/1602], Avg Loss: 5.9516, Avg Acc: 0.0358
+INFO:local_logger:Epoch[005/300], Step[1500/1602], Avg Loss: 5.9332, Avg Acc: 0.0365
+INFO:local_logger:Epoch[005/300], Step[1500/1602], Avg Loss: 5.9545, Avg Acc: 0.0365
+INFO:local_logger:Epoch[005/300], Step[1500/1602], Avg Loss: 5.9542, Avg Acc: 0.0347
+INFO:local_logger:Epoch[005/300], Step[1500/1602], Avg Loss: 5.9554, Avg Acc: 0.0364
+INFO:master_logger:Epoch[005/300], Step[1500/1602], Avg Loss: 5.9493, Avg Acc: 0.0361
+INFO:local_logger:Epoch[005/300], Step[1550/1602], Avg Loss: 5.9485, Avg Acc: 0.0369
+INFO:local_logger:Epoch[005/300], Step[1550/1602], Avg Loss: 5.9527, Avg Acc: 0.0365
+INFO:local_logger:Epoch[005/300], Step[1550/1602], Avg Loss: 5.9505, Avg Acc: 0.0353
+INFO:local_logger:Epoch[005/300], Step[1550/1602], Avg Loss: 5.9309, Avg Acc: 0.0362
+INFO:master_logger:Epoch[005/300], Step[1550/1602], Avg Loss: 5.9456, Avg Acc: 0.0362
+INFO:local_logger:Epoch[005/300], Step[1600/1602], Avg Loss: 5.9489, Avg Acc: 0.0365
+INFO:local_logger:Epoch[005/300], Step[1600/1602], Avg Loss: 5.9483, Avg Acc: 0.0354
+INFO:local_logger:Epoch[005/300], Step[1600/1602], Avg Loss: 5.9449, Avg Acc: 0.0372
+INFO:local_logger:Epoch[005/300], Step[1600/1602], Avg Loss: 5.9253, Avg Acc: 0.0365
+INFO:master_logger:Epoch[005/300], Step[1600/1602], Avg Loss: 5.9418, Avg Acc: 0.0364
+INFO:local_logger:----- Epoch[005/300], Train Loss: 5.9489, Train Acc: 0.0365, time: 3708.13
+INFO:local_logger:Now training epoch 6. LR=0.000391
+INFO:local_logger:----- Epoch[005/300], Train Loss: 5.9253, Train Acc: 0.0365, time: 3708.09
+INFO:local_logger:Now training epoch 6. LR=0.000391
+INFO:local_logger:----- Epoch[005/300], Train Loss: 5.9447, Train Acc: 0.0372, time: 3707.89
+INFO:master_logger:----- Epoch[005/300], Train Loss: 5.9418, Train Acc: 0.0364, time: 3707.89
+INFO:local_logger:----- Epoch[005/300], Train Loss: 5.9483, Train Acc: 0.0354, time: 3708.10
+INFO:local_logger:Now training epoch 6. LR=0.000391
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-5-Loss-5.944693184068049.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-5-Loss-5.944693184068049.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-5-Loss-5.944693184068049-EMA.pdparams
+INFO:local_logger:Now training epoch 6. LR=0.000391
+INFO:master_logger:Now training epoch 6. LR=0.000391
+INFO:local_logger:Epoch[006/300], Step[0000/1602], Avg Loss: 5.9130, Avg Acc: 0.0250
+INFO:local_logger:Epoch[006/300], Step[0000/1602], Avg Loss: 6.0500, Avg Acc: 0.0450
+INFO:local_logger:Epoch[006/300], Step[0000/1602], Avg Loss: 5.4653, Avg Acc: 0.0850
+INFO:local_logger:Epoch[006/300], Step[0000/1602], Avg Loss: 5.5566, Avg Acc: 0.0750
+INFO:master_logger:Epoch[006/300], Step[0000/1602], Avg Loss: 5.7462, Avg Acc: 0.0575
+INFO:local_logger:Epoch[006/300], Step[0050/1602], Avg Loss: 5.8615, Avg Acc: 0.0369
+INFO:local_logger:Epoch[006/300], Step[0050/1602], Avg Loss: 5.8231, Avg Acc: 0.0410
+INFO:local_logger:Epoch[006/300], Step[0050/1602], Avg Loss: 5.7959, Avg Acc: 0.0534
+INFO:local_logger:Epoch[006/300], Step[0050/1602], Avg Loss: 5.8329, Avg Acc: 0.0435
+INFO:master_logger:Epoch[006/300], Step[0050/1602], Avg Loss: 5.8284, Avg Acc: 0.0437
+INFO:local_logger:Epoch[006/300], Step[0100/1602], Avg Loss: 5.8472, Avg Acc: 0.0425
+INFO:local_logger:Epoch[006/300], Step[0100/1602], Avg Loss: 5.8323, Avg Acc: 0.0434
+INFO:local_logger:Epoch[006/300], Step[0100/1602], Avg Loss: 5.8085, Avg Acc: 0.0434
+INFO:local_logger:Epoch[006/300], Step[0100/1602], Avg Loss: 5.8583, Avg Acc: 0.0469
+INFO:master_logger:Epoch[006/300], Step[0100/1602], Avg Loss: 5.8366, Avg Acc: 0.0441
+INFO:local_logger:Epoch[006/300], Step[0150/1602], Avg Loss: 5.8174, Avg Acc: 0.0441
+INFO:local_logger:Epoch[006/300], Step[0150/1602], Avg Loss: 5.8020, Avg Acc: 0.0457
+INFO:local_logger:Epoch[006/300], Step[0150/1602], Avg Loss: 5.8441, Avg Acc: 0.0452
+INFO:local_logger:Epoch[006/300], Step[0150/1602], Avg Loss: 5.8390, Avg Acc: 0.0404
+INFO:master_logger:Epoch[006/300], Step[0150/1602], Avg Loss: 5.8256, Avg Acc: 0.0439
+INFO:local_logger:Epoch[006/300], Step[0200/1602], Avg Loss: 5.8254, Avg Acc: 0.0433
+INFO:local_logger:Epoch[006/300], Step[0200/1602], Avg Loss: 5.8149, Avg Acc: 0.0451
+INFO:local_logger:Epoch[006/300], Step[0200/1602], Avg Loss: 5.8413, Avg Acc: 0.0431
+INFO:master_logger:Epoch[006/300], Step[0200/1602], Avg Loss: 5.8338, Avg Acc: 0.0423
+INFO:local_logger:Epoch[006/300], Step[0200/1602], Avg Loss: 5.8537, Avg Acc: 0.0377
+INFO:local_logger:Epoch[006/300], Step[0250/1602], Avg Loss: 5.8208, Avg Acc: 0.0421
+INFO:local_logger:Epoch[006/300], Step[0250/1602], Avg Loss: 5.8501, Avg Acc: 0.0426
+INFO:local_logger:Epoch[006/300], Step[0250/1602], Avg Loss: 5.8084, Avg Acc: 0.0453
+INFO:master_logger:Epoch[006/300], Step[0250/1602], Avg Loss: 5.8390, Avg Acc: 0.0418
+INFO:local_logger:Epoch[006/300], Step[0250/1602], Avg Loss: 5.8767, Avg Acc: 0.0373
+INFO:local_logger:Epoch[006/300], Step[0300/1602], Avg Loss: 5.8194, Avg Acc: 0.0434
+INFO:local_logger:Epoch[006/300], Step[0300/1602], Avg Loss: 5.8458, Avg Acc: 0.0431
+INFO:local_logger:Epoch[006/300], Step[0300/1602], Avg Loss: 5.8092, Avg Acc: 0.0461
+INFO:master_logger:Epoch[006/300], Step[0300/1602], Avg Loss: 5.8351, Avg Acc: 0.0427
+INFO:local_logger:Epoch[006/300], Step[0300/1602], Avg Loss: 5.8660, Avg Acc: 0.0381
+INFO:local_logger:Epoch[006/300], Step[0350/1602], Avg Loss: 5.8236, Avg Acc: 0.0439
+INFO:local_logger:Epoch[006/300], Step[0350/1602], Avg Loss: 5.8405, Avg Acc: 0.0436
+INFO:local_logger:Epoch[006/300], Step[0350/1602], Avg Loss: 5.8048, Avg Acc: 0.0467
+INFO:local_logger:Epoch[006/300], Step[0350/1602], Avg Loss: 5.8519, Avg Acc: 0.0381
+INFO:master_logger:Epoch[006/300], Step[0350/1602], Avg Loss: 5.8302, Avg Acc: 0.0431
+INFO:local_logger:Epoch[006/300], Step[0400/1602], Avg Loss: 5.8139, Avg Acc: 0.0447
+INFO:local_logger:Epoch[006/300], Step[0400/1602], Avg Loss: 5.8523, Avg Acc: 0.0396
+INFO:local_logger:Epoch[006/300], Step[0400/1602], Avg Loss: 5.8108, Avg Acc: 0.0465
+INFO:local_logger:Epoch[006/300], Step[0400/1602], Avg Loss: 5.8324, Avg Acc: 0.0441
+INFO:master_logger:Epoch[006/300], Step[0400/1602], Avg Loss: 5.8274, Avg Acc: 0.0437
+INFO:local_logger:Epoch[006/300], Step[0450/1602], Avg Loss: 5.8093, Avg Acc: 0.0455
+INFO:local_logger:Epoch[006/300], Step[0450/1602], Avg Loss: 5.8281, Avg Acc: 0.0439
+INFO:local_logger:Epoch[006/300], Step[0450/1602], Avg Loss: 5.8111, Avg Acc: 0.0475
+INFO:local_logger:Epoch[006/300], Step[0450/1602], Avg Loss: 5.8478, Avg Acc: 0.0396
+INFO:master_logger:Epoch[006/300], Step[0450/1602], Avg Loss: 5.8241, Avg Acc: 0.0441
+INFO:local_logger:Epoch[006/300], Step[0500/1602], Avg Loss: 5.8101, Avg Acc: 0.0454
+INFO:local_logger:Epoch[006/300], Step[0500/1602], Avg Loss: 5.8457, Avg Acc: 0.0399
+INFO:local_logger:Epoch[006/300], Step[0500/1602], Avg Loss: 5.8048, Avg Acc: 0.0480
+INFO:master_logger:Epoch[006/300], Step[0500/1602], Avg Loss: 5.8202, Avg Acc: 0.0443
+INFO:local_logger:Epoch[006/300], Step[0500/1602], Avg Loss: 5.8200, Avg Acc: 0.0440
+INFO:local_logger:Epoch[006/300], Step[0550/1602], Avg Loss: 5.8106, Avg Acc: 0.0452
+INFO:local_logger:Epoch[006/300], Step[0550/1602], Avg Loss: 5.8472, Avg Acc: 0.0397
+INFO:local_logger:Epoch[006/300], Step[0550/1602], Avg Loss: 5.8142, Avg Acc: 0.0436
+INFO:local_logger:Epoch[006/300], Step[0550/1602], Avg Loss: 5.8082, Avg Acc: 0.0474
+INFO:master_logger:Epoch[006/300], Step[0550/1602], Avg Loss: 5.8200, Avg Acc: 0.0440
+INFO:local_logger:Epoch[006/300], Step[0600/1602], Avg Loss: 5.8174, Avg Acc: 0.0448
+INFO:local_logger:Epoch[006/300], Step[0600/1602], Avg Loss: 5.8054, Avg Acc: 0.0480
+INFO:local_logger:Epoch[006/300], Step[0600/1602], Avg Loss: 5.8085, Avg Acc: 0.0445
+INFO:local_logger:Epoch[006/300], Step[0600/1602], Avg Loss: 5.8465, Avg Acc: 0.0397
+INFO:master_logger:Epoch[006/300], Step[0600/1602], Avg Loss: 5.8194, Avg Acc: 0.0442
+INFO:local_logger:Epoch[006/300], Step[0650/1602], Avg Loss: 5.8145, Avg Acc: 0.0445
+INFO:local_logger:Epoch[006/300], Step[0650/1602], Avg Loss: 5.8032, Avg Acc: 0.0478
+INFO:local_logger:Epoch[006/300], Step[0650/1602], Avg Loss: 5.8026, Avg Acc: 0.0450
+INFO:local_logger:Epoch[006/300], Step[0650/1602], Avg Loss: 5.8429, Avg Acc: 0.0400
+INFO:master_logger:Epoch[006/300], Step[0650/1602], Avg Loss: 5.8158, Avg Acc: 0.0443
+INFO:local_logger:Epoch[006/300], Step[0700/1602], Avg Loss: 5.8391, Avg Acc: 0.0408
+INFO:local_logger:Epoch[006/300], Step[0700/1602], Avg Loss: 5.8110, Avg Acc: 0.0443
+INFO:local_logger:Epoch[006/300], Step[0700/1602], Avg Loss: 5.7978, Avg Acc: 0.0478
+INFO:local_logger:Epoch[006/300], Step[0700/1602], Avg Loss: 5.7971, Avg Acc: 0.0450
+INFO:master_logger:Epoch[006/300], Step[0700/1602], Avg Loss: 5.8113, Avg Acc: 0.0445
+INFO:local_logger:Epoch[006/300], Step[0750/1602], Avg Loss: 5.8086, Avg Acc: 0.0446
+INFO:local_logger:Epoch[006/300], Step[0750/1602], Avg Loss: 5.7968, Avg Acc: 0.0475
+INFO:local_logger:Epoch[006/300], Step[0750/1602], Avg Loss: 5.8327, Avg Acc: 0.0411
+INFO:local_logger:Epoch[006/300], Step[0750/1602], Avg Loss: 5.7935, Avg Acc: 0.0458
+INFO:master_logger:Epoch[006/300], Step[0750/1602], Avg Loss: 5.8079, Avg Acc: 0.0447
+INFO:local_logger:Epoch[006/300], Step[0800/1602], Avg Loss: 5.8054, Avg Acc: 0.0445
+INFO:local_logger:Epoch[006/300], Step[0800/1602], Avg Loss: 5.7900, Avg Acc: 0.0459
+INFO:local_logger:Epoch[006/300], Step[0800/1602], Avg Loss: 5.7954, Avg Acc: 0.0477
+INFO:master_logger:Epoch[006/300], Step[0800/1602], Avg Loss: 5.8036, Avg Acc: 0.0448
+INFO:local_logger:Epoch[006/300], Step[0800/1602], Avg Loss: 5.8237, Avg Acc: 0.0413
+INFO:local_logger:Epoch[006/300], Step[0850/1602], Avg Loss: 5.8016, Avg Acc: 0.0444
+INFO:local_logger:Epoch[006/300], Step[0850/1602], Avg Loss: 5.7915, Avg Acc: 0.0460
+INFO:local_logger:Epoch[006/300], Step[0850/1602], Avg Loss: 5.7913, Avg Acc: 0.0474
+INFO:local_logger:Epoch[006/300], Step[0850/1602], Avg Loss: 5.8201, Avg Acc: 0.0418
+INFO:master_logger:Epoch[006/300], Step[0850/1602], Avg Loss: 5.8011, Avg Acc: 0.0449
+INFO:local_logger:Epoch[006/300], Step[0900/1602], Avg Loss: 5.8001, Avg Acc: 0.0449
+INFO:local_logger:Epoch[006/300], Step[0900/1602], Avg Loss: 5.7882, Avg Acc: 0.0463
+INFO:local_logger:Epoch[006/300], Step[0900/1602], Avg Loss: 5.7871, Avg Acc: 0.0473
+INFO:master_logger:Epoch[006/300], Step[0900/1602], Avg Loss: 5.7962, Avg Acc: 0.0452
+INFO:local_logger:Epoch[006/300], Step[0900/1602], Avg Loss: 5.8094, Avg Acc: 0.0421
+INFO:local_logger:Epoch[006/300], Step[0950/1602], Avg Loss: 5.7941, Avg Acc: 0.0451
+INFO:local_logger:Epoch[006/300], Step[0950/1602], Avg Loss: 5.8050, Avg Acc: 0.0427
+INFO:local_logger:Epoch[006/300], Step[0950/1602], Avg Loss: 5.7849, Avg Acc: 0.0466
+INFO:local_logger:Epoch[006/300], Step[0950/1602], Avg Loss: 5.7898, Avg Acc: 0.0471
+INFO:master_logger:Epoch[006/300], Step[0950/1602], Avg Loss: 5.7934, Avg Acc: 0.0454
+INFO:local_logger:Epoch[006/300], Step[1000/1602], Avg Loss: 5.7799, Avg Acc: 0.0470
+INFO:local_logger:Epoch[006/300], Step[1000/1602], Avg Loss: 5.7905, Avg Acc: 0.0452
+INFO:local_logger:Epoch[006/300], Step[1000/1602], Avg Loss: 5.7822, Avg Acc: 0.0474
+INFO:local_logger:Epoch[006/300], Step[1000/1602], Avg Loss: 5.8011, Avg Acc: 0.0429
+INFO:master_logger:Epoch[006/300], Step[1000/1602], Avg Loss: 5.7884, Avg Acc: 0.0457
+INFO:local_logger:Epoch[006/300], Step[1050/1602], Avg Loss: 5.7861, Avg Acc: 0.0458
+INFO:local_logger:Epoch[006/300], Step[1050/1602], Avg Loss: 5.7938, Avg Acc: 0.0434
+INFO:local_logger:Epoch[006/300], Step[1050/1602], Avg Loss: 5.7769, Avg Acc: 0.0475
+INFO:master_logger:Epoch[006/300], Step[1050/1602], Avg Loss: 5.7833, Avg Acc: 0.0461
+INFO:local_logger:Epoch[006/300], Step[1050/1602], Avg Loss: 5.7762, Avg Acc: 0.0476
+INFO:local_logger:Epoch[006/300], Step[1100/1602], Avg Loss: 5.7825, Avg Acc: 0.0461
+INFO:local_logger:Epoch[006/300], Step[1100/1602], Avg Loss: 5.7716, Avg Acc: 0.0478
+INFO:local_logger:Epoch[006/300], Step[1100/1602], Avg Loss: 5.7739, Avg Acc: 0.0476
+INFO:local_logger:Epoch[006/300], Step[1100/1602], Avg Loss: 5.7922, Avg Acc: 0.0436
+INFO:master_logger:Epoch[006/300], Step[1100/1602], Avg Loss: 5.7801, Avg Acc: 0.0463
+INFO:local_logger:Epoch[006/300], Step[1150/1602], Avg Loss: 5.7759, Avg Acc: 0.0467
+INFO:local_logger:Epoch[006/300], Step[1150/1602], Avg Loss: 5.7720, Avg Acc: 0.0478
+INFO:local_logger:Epoch[006/300], Step[1150/1602], Avg Loss: 5.7887, Avg Acc: 0.0439
+INFO:master_logger:Epoch[006/300], Step[1150/1602], Avg Loss: 5.7768, Avg Acc: 0.0465
+INFO:local_logger:Epoch[006/300], Step[1150/1602], Avg Loss: 5.7705, Avg Acc: 0.0477
+INFO:local_logger:Epoch[006/300], Step[1200/1602], Avg Loss: 5.7728, Avg Acc: 0.0468
+INFO:local_logger:Epoch[006/300], Step[1200/1602], Avg Loss: 5.7670, Avg Acc: 0.0484
+INFO:local_logger:Epoch[006/300], Step[1200/1602], Avg Loss: 5.7660, Avg Acc: 0.0481
+INFO:master_logger:Epoch[006/300], Step[1200/1602], Avg Loss: 5.7726, Avg Acc: 0.0469
+INFO:local_logger:Epoch[006/300], Step[1200/1602], Avg Loss: 5.7845, Avg Acc: 0.0443
+INFO:local_logger:Epoch[006/300], Step[1250/1602], Avg Loss: 5.7634, Avg Acc: 0.0487
+INFO:local_logger:Epoch[006/300], Step[1250/1602], Avg Loss: 5.7652, Avg Acc: 0.0484
+INFO:local_logger:Epoch[006/300], Step[1250/1602], Avg Loss: 5.7799, Avg Acc: 0.0447
+INFO:local_logger:Epoch[006/300], Step[1250/1602], Avg Loss: 5.7729, Avg Acc: 0.0470
+INFO:master_logger:Epoch[006/300], Step[1250/1602], Avg Loss: 5.7704, Avg Acc: 0.0472
+INFO:local_logger:Epoch[006/300], Step[1300/1602], Avg Loss: 5.7704, Avg Acc: 0.0472
+INFO:local_logger:Epoch[006/300], Step[1300/1602], Avg Loss: 5.7766, Avg Acc: 0.0448
+INFO:local_logger:Epoch[006/300], Step[1300/1602], Avg Loss: 5.7601, Avg Acc: 0.0492
+INFO:local_logger:Epoch[006/300], Step[1300/1602], Avg Loss: 5.7595, Avg Acc: 0.0485
+INFO:master_logger:Epoch[006/300], Step[1300/1602], Avg Loss: 5.7667, Avg Acc: 0.0474
+INFO:local_logger:Epoch[006/300], Step[1350/1602], Avg Loss: 5.7727, Avg Acc: 0.0454
+INFO:local_logger:Epoch[006/300], Step[1350/1602], Avg Loss: 5.7670, Avg Acc: 0.0470
+INFO:local_logger:Epoch[006/300], Step[1350/1602], Avg Loss: 5.7560, Avg Acc: 0.0486
+INFO:local_logger:Epoch[006/300], Step[1350/1602], Avg Loss: 5.7595, Avg Acc: 0.0494
+INFO:master_logger:Epoch[006/300], Step[1350/1602], Avg Loss: 5.7638, Avg Acc: 0.0476
+INFO:local_logger:Epoch[006/300], Step[1400/1602], Avg Loss: 5.7691, Avg Acc: 0.0451
+INFO:local_logger:Epoch[006/300], Step[1400/1602], Avg Loss: 5.7561, Avg Acc: 0.0486
+INFO:local_logger:Epoch[006/300], Step[1400/1602], Avg Loss: 5.7567, Avg Acc: 0.0495
+INFO:local_logger:Epoch[006/300], Step[1400/1602], Avg Loss: 5.7669, Avg Acc: 0.0470
+INFO:master_logger:Epoch[006/300], Step[1400/1602], Avg Loss: 5.7622, Avg Acc: 0.0476
+INFO:local_logger:Epoch[006/300], Step[1450/1602], Avg Loss: 5.7639, Avg Acc: 0.0470
+INFO:master_logger:Epoch[006/300], Step[1450/1602], Avg Loss: 5.7589, Avg Acc: 0.0479
+INFO:local_logger:Epoch[006/300], Step[1450/1602], Avg Loss: 5.7540, Avg Acc: 0.0487
+INFO:local_logger:Epoch[006/300], Step[1450/1602], Avg Loss: 5.7522, Avg Acc: 0.0500
+INFO:local_logger:Epoch[006/300], Step[1450/1602], Avg Loss: 5.7657, Avg Acc: 0.0457
+INFO:local_logger:Epoch[006/300], Step[1500/1602], Avg Loss: 5.7654, Avg Acc: 0.0459
+INFO:local_logger:Epoch[006/300], Step[1500/1602], Avg Loss: 5.7496, Avg Acc: 0.0491
+INFO:local_logger:Epoch[006/300], Step[1500/1602], Avg Loss: 5.7476, Avg Acc: 0.0498
+INFO:local_logger:Epoch[006/300], Step[1500/1602], Avg Loss: 5.7627, Avg Acc: 0.0470
+INFO:master_logger:Epoch[006/300], Step[1500/1602], Avg Loss: 5.7563, Avg Acc: 0.0479
+INFO:local_logger:Epoch[006/300], Step[1550/1602], Avg Loss: 5.7614, Avg Acc: 0.0473
+INFO:local_logger:Epoch[006/300], Step[1550/1602], Avg Loss: 5.7626, Avg Acc: 0.0462
+INFO:local_logger:Epoch[006/300], Step[1550/1602], Avg Loss: 5.7466, Avg Acc: 0.0498
+INFO:local_logger:Epoch[006/300], Step[1550/1602], Avg Loss: 5.7493, Avg Acc: 0.0493
+INFO:master_logger:Epoch[006/300], Step[1550/1602], Avg Loss: 5.7550, Avg Acc: 0.0482
+INFO:local_logger:Epoch[006/300], Step[1600/1602], Avg Loss: 5.7601, Avg Acc: 0.0462
+INFO:local_logger:Epoch[006/300], Step[1600/1602], Avg Loss: 5.7607, Avg Acc: 0.0476
+INFO:local_logger:Epoch[006/300], Step[1600/1602], Avg Loss: 5.7435, Avg Acc: 0.0496
+INFO:local_logger:Epoch[006/300], Step[1600/1602], Avg Loss: 5.7462, Avg Acc: 0.0494
+INFO:master_logger:Epoch[006/300], Step[1600/1602], Avg Loss: 5.7526, Avg Acc: 0.0482
+INFO:local_logger:----- Epoch[006/300], Train Loss: 5.7461, Train Acc: 0.0495, time: 3716.71
+INFO:local_logger:Now training epoch 7. LR=0.000391
+INFO:local_logger:----- Epoch[006/300], Train Loss: 5.7606, Train Acc: 0.0476, time: 3716.53
+INFO:master_logger:----- Epoch[006/300], Train Loss: 5.7525, Train Acc: 0.0482, time: 3716.53
+INFO:local_logger:----- Epoch[006/300], Train Loss: 5.7434, Train Acc: 0.0496, time: 3716.77
+INFO:local_logger:Now training epoch 7. LR=0.000391
+INFO:local_logger:----- Epoch[006/300], Train Loss: 5.7599, Train Acc: 0.0462, time: 3716.79
+INFO:local_logger:Now training epoch 7. LR=0.000391
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-6-Loss-5.76062138763371.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-6-Loss-5.76062138763371.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-6-Loss-5.76062138763371-EMA.pdparams
+INFO:local_logger:Now training epoch 7. LR=0.000391
+INFO:master_logger:Now training epoch 7. LR=0.000391
+INFO:local_logger:Epoch[007/300], Step[0000/1602], Avg Loss: 5.1478, Avg Acc: 0.1150
+INFO:local_logger:Epoch[007/300], Step[0000/1602], Avg Loss: 5.5719, Avg Acc: 0.0800
+INFO:local_logger:Epoch[007/300], Step[0000/1602], Avg Loss: 5.1127, Avg Acc: 0.1450
+INFO:master_logger:Epoch[007/300], Step[0000/1602], Avg Loss: 5.3195, Avg Acc: 0.0862
+INFO:local_logger:Epoch[007/300], Step[0000/1602], Avg Loss: 5.4455, Avg Acc: 0.0050
+INFO:local_logger:Epoch[007/300], Step[0050/1602], Avg Loss: 5.6266, Avg Acc: 0.0529
+INFO:local_logger:Epoch[007/300], Step[0050/1602], Avg Loss: 5.6447, Avg Acc: 0.0607
+INFO:local_logger:Epoch[007/300], Step[0050/1602], Avg Loss: 5.6618, Avg Acc: 0.0553
+INFO:local_logger:Epoch[007/300], Step[0050/1602], Avg Loss: 5.6362, Avg Acc: 0.0546
+INFO:master_logger:Epoch[007/300], Step[0050/1602], Avg Loss: 5.6423, Avg Acc: 0.0559
+INFO:local_logger:Epoch[007/300], Step[0100/1602], Avg Loss: 5.6625, Avg Acc: 0.0566
+INFO:local_logger:Epoch[007/300], Step[0100/1602], Avg Loss: 5.6619, Avg Acc: 0.0557
+INFO:local_logger:Epoch[007/300], Step[0100/1602], Avg Loss: 5.6856, Avg Acc: 0.0551
+INFO:local_logger:Epoch[007/300], Step[0100/1602], Avg Loss: 5.6459, Avg Acc: 0.0550
+INFO:master_logger:Epoch[007/300], Step[0100/1602], Avg Loss: 5.6640, Avg Acc: 0.0556
+INFO:local_logger:Epoch[007/300], Step[0150/1602], Avg Loss: 5.6533, Avg Acc: 0.0496
+INFO:local_logger:Epoch[007/300], Step[0150/1602], Avg Loss: 5.6704, Avg Acc: 0.0512
+INFO:local_logger:Epoch[007/300], Step[0150/1602], Avg Loss: 5.6452, Avg Acc: 0.0588
+INFO:local_logger:Epoch[007/300], Step[0150/1602], Avg Loss: 5.6878, Avg Acc: 0.0495
+INFO:master_logger:Epoch[007/300], Step[0150/1602], Avg Loss: 5.6642, Avg Acc: 0.0523
+INFO:local_logger:Epoch[007/300], Step[0200/1602], Avg Loss: 5.6554, Avg Acc: 0.0505
+INFO:local_logger:Epoch[007/300], Step[0200/1602], Avg Loss: 5.6762, Avg Acc: 0.0515
+INFO:local_logger:Epoch[007/300], Step[0200/1602], Avg Loss: 5.6437, Avg Acc: 0.0589
+INFO:local_logger:Epoch[007/300], Step[0200/1602], Avg Loss: 5.6746, Avg Acc: 0.0511
+INFO:master_logger:Epoch[007/300], Step[0200/1602], Avg Loss: 5.6625, Avg Acc: 0.0530
+INFO:local_logger:Epoch[007/300], Step[0250/1602], Avg Loss: 5.6457, Avg Acc: 0.0511
+INFO:local_logger:Epoch[007/300], Step[0250/1602], Avg Loss: 5.6656, Avg Acc: 0.0543
+INFO:local_logger:Epoch[007/300], Step[0250/1602], Avg Loss: 5.6839, Avg Acc: 0.0525
+INFO:local_logger:Epoch[007/300], Step[0250/1602], Avg Loss: 5.6411, Avg Acc: 0.0573
+INFO:master_logger:Epoch[007/300], Step[0250/1602], Avg Loss: 5.6591, Avg Acc: 0.0538
+INFO:local_logger:Epoch[007/300], Step[0300/1602], Avg Loss: 5.6618, Avg Acc: 0.0538
+INFO:local_logger:Epoch[007/300], Step[0300/1602], Avg Loss: 5.6668, Avg Acc: 0.0523
+INFO:local_logger:Epoch[007/300], Step[0300/1602], Avg Loss: 5.6497, Avg Acc: 0.0510
+INFO:local_logger:Epoch[007/300], Step[0300/1602], Avg Loss: 5.6371, Avg Acc: 0.0577
+INFO:master_logger:Epoch[007/300], Step[0300/1602], Avg Loss: 5.6538, Avg Acc: 0.0537
+INFO:local_logger:Epoch[007/300], Step[0350/1602], Avg Loss: 5.6629, Avg Acc: 0.0531
+INFO:local_logger:Epoch[007/300], Step[0350/1602], Avg Loss: 5.6359, Avg Acc: 0.0558
+INFO:local_logger:Epoch[007/300], Step[0350/1602], Avg Loss: 5.6599, Avg Acc: 0.0524
+INFO:local_logger:Epoch[007/300], Step[0350/1602], Avg Loss: 5.6489, Avg Acc: 0.0519
+INFO:master_logger:Epoch[007/300], Step[0350/1602], Avg Loss: 5.6519, Avg Acc: 0.0533
+INFO:local_logger:Epoch[007/300], Step[0400/1602], Avg Loss: 5.6553, Avg Acc: 0.0536
+INFO:local_logger:Epoch[007/300], Step[0400/1602], Avg Loss: 5.6359, Avg Acc: 0.0557
+INFO:local_logger:Epoch[007/300], Step[0400/1602], Avg Loss: 5.6512, Avg Acc: 0.0536
+INFO:master_logger:Epoch[007/300], Step[0400/1602], Avg Loss: 5.6487, Avg Acc: 0.0537
+INFO:local_logger:Epoch[007/300], Step[0400/1602], Avg Loss: 5.6525, Avg Acc: 0.0519
+INFO:local_logger:Epoch[007/300], Step[0450/1602], Avg Loss: 5.6433, Avg Acc: 0.0536
+INFO:local_logger:Epoch[007/300], Step[0450/1602], Avg Loss: 5.6340, Avg Acc: 0.0544
+INFO:local_logger:Epoch[007/300], Step[0450/1602], Avg Loss: 5.6336, Avg Acc: 0.0558
+INFO:master_logger:Epoch[007/300], Step[0450/1602], Avg Loss: 5.6397, Avg Acc: 0.0541
+INFO:local_logger:Epoch[007/300], Step[0450/1602], Avg Loss: 5.6479, Avg Acc: 0.0525
+INFO:local_logger:Epoch[007/300], Step[0500/1602], Avg Loss: 5.6472, Avg Acc: 0.0523
+INFO:local_logger:Epoch[007/300], Step[0500/1602], Avg Loss: 5.6359, Avg Acc: 0.0548
+INFO:local_logger:Epoch[007/300], Step[0500/1602], Avg Loss: 5.6331, Avg Acc: 0.0560
+INFO:master_logger:Epoch[007/300], Step[0500/1602], Avg Loss: 5.6349, Avg Acc: 0.0548
+INFO:local_logger:Epoch[007/300], Step[0500/1602], Avg Loss: 5.6234, Avg Acc: 0.0562
+INFO:local_logger:Epoch[007/300], Step[0550/1602], Avg Loss: 5.6440, Avg Acc: 0.0548
+INFO:local_logger:Epoch[007/300], Step[0550/1602], Avg Loss: 5.6285, Avg Acc: 0.0557
+INFO:local_logger:Epoch[007/300], Step[0550/1602], Avg Loss: 5.6265, Avg Acc: 0.0563
+INFO:local_logger:Epoch[007/300], Step[0550/1602], Avg Loss: 5.6387, Avg Acc: 0.0534
+INFO:master_logger:Epoch[007/300], Step[0550/1602], Avg Loss: 5.6344, Avg Acc: 0.0551
+INFO:local_logger:Epoch[007/300], Step[0600/1602], Avg Loss: 5.6330, Avg Acc: 0.0547
+INFO:local_logger:Epoch[007/300], Step[0600/1602], Avg Loss: 5.6407, Avg Acc: 0.0544
+INFO:local_logger:Epoch[007/300], Step[0600/1602], Avg Loss: 5.6295, Avg Acc: 0.0557
+INFO:local_logger:Epoch[007/300], Step[0600/1602], Avg Loss: 5.6256, Avg Acc: 0.0570
+INFO:master_logger:Epoch[007/300], Step[0600/1602], Avg Loss: 5.6322, Avg Acc: 0.0554
+INFO:local_logger:Epoch[007/300], Step[0650/1602], Avg Loss: 5.6344, Avg Acc: 0.0549
+INFO:local_logger:Epoch[007/300], Step[0650/1602], Avg Loss: 5.6205, Avg Acc: 0.0572
+INFO:local_logger:Epoch[007/300], Step[0650/1602], Avg Loss: 5.6268, Avg Acc: 0.0556
+INFO:master_logger:Epoch[007/300], Step[0650/1602], Avg Loss: 5.6255, Avg Acc: 0.0559
+INFO:local_logger:Epoch[007/300], Step[0650/1602], Avg Loss: 5.6204, Avg Acc: 0.0557
+INFO:local_logger:Epoch[007/300], Step[0700/1602], Avg Loss: 5.6276, Avg Acc: 0.0554
+INFO:local_logger:Epoch[007/300], Step[0700/1602], Avg Loss: 5.6350, Avg Acc: 0.0552
+INFO:local_logger:Epoch[007/300], Step[0700/1602], Avg Loss: 5.6133, Avg Acc: 0.0558
+INFO:local_logger:Epoch[007/300], Step[0700/1602], Avg Loss: 5.6183, Avg Acc: 0.0571
+INFO:master_logger:Epoch[007/300], Step[0700/1602], Avg Loss: 5.6236, Avg Acc: 0.0559
+INFO:local_logger:Epoch[007/300], Step[0750/1602], Avg Loss: 5.6341, Avg Acc: 0.0552
+INFO:local_logger:Epoch[007/300], Step[0750/1602], Avg Loss: 5.6274, Avg Acc: 0.0556
+INFO:local_logger:Epoch[007/300], Step[0750/1602], Avg Loss: 5.6158, Avg Acc: 0.0560
+INFO:local_logger:Epoch[007/300], Step[0750/1602], Avg Loss: 5.6131, Avg Acc: 0.0573
+INFO:master_logger:Epoch[007/300], Step[0750/1602], Avg Loss: 5.6226, Avg Acc: 0.0560
+INFO:local_logger:Epoch[007/300], Step[0800/1602], Avg Loss: 5.6290, Avg Acc: 0.0558
+INFO:local_logger:Epoch[007/300], Step[0800/1602], Avg Loss: 5.6081, Avg Acc: 0.0574
+INFO:local_logger:Epoch[007/300], Step[0800/1602], Avg Loss: 5.6161, Avg Acc: 0.0558
+INFO:local_logger:Epoch[007/300], Step[0800/1602], Avg Loss: 5.6243, Avg Acc: 0.0561
+INFO:master_logger:Epoch[007/300], Step[0800/1602], Avg Loss: 5.6194, Avg Acc: 0.0563
+INFO:local_logger:Epoch[007/300], Step[0850/1602], Avg Loss: 5.6288, Avg Acc: 0.0557
+INFO:local_logger:Epoch[007/300], Step[0850/1602], Avg Loss: 5.6215, Avg Acc: 0.0561
+INFO:local_logger:Epoch[007/300], Step[0850/1602], Avg Loss: 5.6168, Avg Acc: 0.0558
+INFO:local_logger:Epoch[007/300], Step[0850/1602], Avg Loss: 5.6095, Avg Acc: 0.0570
+INFO:master_logger:Epoch[007/300], Step[0850/1602], Avg Loss: 5.6191, Avg Acc: 0.0562
+INFO:local_logger:Epoch[007/300], Step[0900/1602], Avg Loss: 5.6216, Avg Acc: 0.0556
+INFO:local_logger:Epoch[007/300], Step[0900/1602], Avg Loss: 5.6175, Avg Acc: 0.0564
+INFO:local_logger:Epoch[007/300], Step[0900/1602], Avg Loss: 5.6115, Avg Acc: 0.0570
+INFO:local_logger:Epoch[007/300], Step[0900/1602], Avg Loss: 5.6137, Avg Acc: 0.0564
+INFO:master_logger:Epoch[007/300], Step[0900/1602], Avg Loss: 5.6161, Avg Acc: 0.0564
+INFO:local_logger:Epoch[007/300], Step[0950/1602], Avg Loss: 5.6247, Avg Acc: 0.0556
+INFO:local_logger:Epoch[007/300], Step[0950/1602], Avg Loss: 5.6132, Avg Acc: 0.0569
+INFO:local_logger:Epoch[007/300], Step[0950/1602], Avg Loss: 5.6089, Avg Acc: 0.0569
+INFO:local_logger:Epoch[007/300], Step[0950/1602], Avg Loss: 5.6160, Avg Acc: 0.0566
+INFO:master_logger:Epoch[007/300], Step[0950/1602], Avg Loss: 5.6157, Avg Acc: 0.0565
+INFO:local_logger:Epoch[007/300], Step[1000/1602], Avg Loss: 5.6213, Avg Acc: 0.0560
+INFO:local_logger:Epoch[007/300], Step[1000/1602], Avg Loss: 5.6122, Avg Acc: 0.0569
+INFO:local_logger:Epoch[007/300], Step[1000/1602], Avg Loss: 5.6045, Avg Acc: 0.0578
+INFO:local_logger:Epoch[007/300], Step[1000/1602], Avg Loss: 5.6128, Avg Acc: 0.0572
+INFO:master_logger:Epoch[007/300], Step[1000/1602], Avg Loss: 5.6127, Avg Acc: 0.0570
+INFO:local_logger:Epoch[007/300], Step[1050/1602], Avg Loss: 5.6158, Avg Acc: 0.0565
+INFO:local_logger:Epoch[007/300], Step[1050/1602], Avg Loss: 5.6122, Avg Acc: 0.0571
+INFO:local_logger:Epoch[007/300], Step[1050/1602], Avg Loss: 5.6038, Avg Acc: 0.0577
+INFO:local_logger:Epoch[007/300], Step[1050/1602], Avg Loss: 5.6105, Avg Acc: 0.0571
+INFO:master_logger:Epoch[007/300], Step[1050/1602], Avg Loss: 5.6106, Avg Acc: 0.0571
+INFO:local_logger:Epoch[007/300], Step[1100/1602], Avg Loss: 5.6130, Avg Acc: 0.0564
+INFO:local_logger:Epoch[007/300], Step[1100/1602], Avg Loss: 5.6081, Avg Acc: 0.0570
+INFO:local_logger:Epoch[007/300], Step[1100/1602], Avg Loss: 5.6094, Avg Acc: 0.0572
+INFO:local_logger:Epoch[007/300], Step[1100/1602], Avg Loss: 5.6053, Avg Acc: 0.0578
+INFO:master_logger:Epoch[007/300], Step[1100/1602], Avg Loss: 5.6089, Avg Acc: 0.0571
+INFO:local_logger:Epoch[007/300], Step[1150/1602], Avg Loss: 5.6041, Avg Acc: 0.0573
+INFO:local_logger:Epoch[007/300], Step[1150/1602], Avg Loss: 5.6109, Avg Acc: 0.0562
+INFO:local_logger:Epoch[007/300], Step[1150/1602], Avg Loss: 5.6052, Avg Acc: 0.0576
+INFO:local_logger:Epoch[007/300], Step[1150/1602], Avg Loss: 5.5998, Avg Acc: 0.0580
+INFO:master_logger:Epoch[007/300], Step[1150/1602], Avg Loss: 5.6050, Avg Acc: 0.0573
+INFO:local_logger:Epoch[007/300], Step[1200/1602], Avg Loss: 5.6070, Avg Acc: 0.0571
+INFO:local_logger:Epoch[007/300], Step[1200/1602], Avg Loss: 5.6007, Avg Acc: 0.0579
+INFO:local_logger:Epoch[007/300], Step[1200/1602], Avg Loss: 5.6026, Avg Acc: 0.0579
+INFO:local_logger:Epoch[007/300], Step[1200/1602], Avg Loss: 5.6016, Avg Acc: 0.0580
+INFO:master_logger:Epoch[007/300], Step[1200/1602], Avg Loss: 5.6030, Avg Acc: 0.0577
+INFO:local_logger:Epoch[007/300], Step[1250/1602], Avg Loss: 5.6088, Avg Acc: 0.0569
+INFO:local_logger:Epoch[007/300], Step[1250/1602], Avg Loss: 5.6006, Avg Acc: 0.0579
+INFO:local_logger:Epoch[007/300], Step[1250/1602], Avg Loss: 5.6013, Avg Acc: 0.0584
+INFO:local_logger:Epoch[007/300], Step[1250/1602], Avg Loss: 5.6011, Avg Acc: 0.0584
+INFO:master_logger:Epoch[007/300], Step[1250/1602], Avg Loss: 5.6030, Avg Acc: 0.0579
+INFO:local_logger:Epoch[007/300], Step[1300/1602], Avg Loss: 5.5981, Avg Acc: 0.0583
+INFO:local_logger:Epoch[007/300], Step[1300/1602], Avg Loss: 5.6035, Avg Acc: 0.0575
+INFO:local_logger:Epoch[007/300], Step[1300/1602], Avg Loss: 5.5959, Avg Acc: 0.0583
+INFO:local_logger:Epoch[007/300], Step[1300/1602], Avg Loss: 5.5946, Avg Acc: 0.0583
+INFO:master_logger:Epoch[007/300], Step[1300/1602], Avg Loss: 5.5980, Avg Acc: 0.0581
+INFO:local_logger:Epoch[007/300], Step[1350/1602], Avg Loss: 5.5913, Avg Acc: 0.0585
+INFO:local_logger:Epoch[007/300], Step[1350/1602], Avg Loss: 5.5932, Avg Acc: 0.0584
+INFO:local_logger:Epoch[007/300], Step[1350/1602], Avg Loss: 5.5975, Avg Acc: 0.0575
+INFO:local_logger:Epoch[007/300], Step[1350/1602], Avg Loss: 5.5970, Avg Acc: 0.0586
+INFO:master_logger:Epoch[007/300], Step[1350/1602], Avg Loss: 5.5948, Avg Acc: 0.0582
+INFO:local_logger:Epoch[007/300], Step[1400/1602], Avg Loss: 5.5955, Avg Acc: 0.0578
+INFO:local_logger:Epoch[007/300], Step[1400/1602], Avg Loss: 5.5868, Avg Acc: 0.0588
+INFO:local_logger:Epoch[007/300], Step[1400/1602], Avg Loss: 5.5931, Avg Acc: 0.0586
+INFO:local_logger:Epoch[007/300], Step[1400/1602], Avg Loss: 5.5908, Avg Acc: 0.0585
+INFO:master_logger:Epoch[007/300], Step[1400/1602], Avg Loss: 5.5915, Avg Acc: 0.0584
+INFO:local_logger:Epoch[007/300], Step[1450/1602], Avg Loss: 5.5931, Avg Acc: 0.0581
+INFO:local_logger:Epoch[007/300], Step[1450/1602], Avg Loss: 5.5878, Avg Acc: 0.0592
+INFO:local_logger:Epoch[007/300], Step[1450/1602], Avg Loss: 5.5858, Avg Acc: 0.0590
+INFO:local_logger:Epoch[007/300], Step[1450/1602], Avg Loss: 5.5925, Avg Acc: 0.0583
+INFO:master_logger:Epoch[007/300], Step[1450/1602], Avg Loss: 5.5898, Avg Acc: 0.0586
+INFO:local_logger:Epoch[007/300], Step[1500/1602], Avg Loss: 5.5903, Avg Acc: 0.0585
+INFO:local_logger:Epoch[007/300], Step[1500/1602], Avg Loss: 5.5910, Avg Acc: 0.0584
+INFO:local_logger:Epoch[007/300], Step[1500/1602], Avg Loss: 5.5867, Avg Acc: 0.0593
+INFO:local_logger:Epoch[007/300], Step[1500/1602], Avg Loss: 5.5872, Avg Acc: 0.0591
+INFO:master_logger:Epoch[007/300], Step[1500/1602], Avg Loss: 5.5888, Avg Acc: 0.0588
+INFO:local_logger:Epoch[007/300], Step[1550/1602], Avg Loss: 5.5856, Avg Acc: 0.0595
+INFO:local_logger:Epoch[007/300], Step[1550/1602], Avg Loss: 5.5853, Avg Acc: 0.0592
+INFO:local_logger:Epoch[007/300], Step[1550/1602], Avg Loss: 5.5874, Avg Acc: 0.0587
+INFO:local_logger:Epoch[007/300], Step[1550/1602], Avg Loss: 5.5885, Avg Acc: 0.0587
+INFO:master_logger:Epoch[007/300], Step[1550/1602], Avg Loss: 5.5867, Avg Acc: 0.0590
+INFO:local_logger:Epoch[007/300], Step[1600/1602], Avg Loss: 5.5804, Avg Acc: 0.0597
+INFO:local_logger:Epoch[007/300], Step[1600/1602], Avg Loss: 5.5869, Avg Acc: 0.0587
+INFO:local_logger:Epoch[007/300], Step[1600/1602], Avg Loss: 5.5826, Avg Acc: 0.0592
+INFO:master_logger:Epoch[007/300], Step[1600/1602], Avg Loss: 5.5830, Avg Acc: 0.0593
+INFO:local_logger:Epoch[007/300], Step[1600/1602], Avg Loss: 5.5821, Avg Acc: 0.0597
+INFO:local_logger:----- Epoch[007/300], Train Loss: 5.5822, Train Acc: 0.0597, time: 3712.03
+INFO:local_logger:Now training epoch 8. LR=0.000391
+INFO:local_logger:----- Epoch[007/300], Train Loss: 5.5826, Train Acc: 0.0592, time: 3711.76
+INFO:master_logger:----- Epoch[007/300], Train Loss: 5.5830, Train Acc: 0.0593, time: 3711.76
+INFO:local_logger:----- Epoch[007/300], Train Loss: 5.5804, Train Acc: 0.0597, time: 3712.01
+INFO:local_logger:Now training epoch 8. LR=0.000391
+INFO:local_logger:----- Epoch[007/300], Train Loss: 5.5869, Train Acc: 0.0587, time: 3712.03
+INFO:local_logger:Now training epoch 8. LR=0.000391
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-7-Loss-5.58258142911152.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-7-Loss-5.58258142911152.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-7-Loss-5.58258142911152-EMA.pdparams
+INFO:local_logger:Now training epoch 8. LR=0.000391
+INFO:master_logger:Now training epoch 8. LR=0.000391
+INFO:local_logger:Epoch[008/300], Step[0000/1602], Avg Loss: 6.0431, Avg Acc: 0.0400
+INFO:local_logger:Epoch[008/300], Step[0000/1602], Avg Loss: 5.9718, Avg Acc: 0.0600
+INFO:local_logger:Epoch[008/300], Step[0000/1602], Avg Loss: 6.0673, Avg Acc: 0.0350
+INFO:master_logger:Epoch[008/300], Step[0000/1602], Avg Loss: 5.9514, Avg Acc: 0.0425
+INFO:local_logger:Epoch[008/300], Step[0000/1602], Avg Loss: 5.7235, Avg Acc: 0.0350
+INFO:local_logger:Epoch[008/300], Step[0050/1602], Avg Loss: 5.5452, Avg Acc: 0.0600
+INFO:local_logger:Epoch[008/300], Step[0050/1602], Avg Loss: 5.4363, Avg Acc: 0.0763
+INFO:local_logger:Epoch[008/300], Step[0050/1602], Avg Loss: 5.5076, Avg Acc: 0.0506
+INFO:local_logger:Epoch[008/300], Step[0050/1602], Avg Loss: 5.3581, Avg Acc: 0.0695
+INFO:master_logger:Epoch[008/300], Step[0050/1602], Avg Loss: 5.4618, Avg Acc: 0.0641
+INFO:local_logger:Epoch[008/300], Step[0100/1602], Avg Loss: 5.4129, Avg Acc: 0.0750
+INFO:local_logger:Epoch[008/300], Step[0100/1602], Avg Loss: 5.5176, Avg Acc: 0.0625
+INFO:local_logger:Epoch[008/300], Step[0100/1602], Avg Loss: 5.4789, Avg Acc: 0.0669
+INFO:local_logger:Epoch[008/300], Step[0100/1602], Avg Loss: 5.5076, Avg Acc: 0.0536
+INFO:master_logger:Epoch[008/300], Step[0100/1602], Avg Loss: 5.4793, Avg Acc: 0.0645
+INFO:local_logger:Epoch[008/300], Step[0150/1602], Avg Loss: 5.4450, Avg Acc: 0.0733
+INFO:local_logger:Epoch[008/300], Step[0150/1602], Avg Loss: 5.4676, Avg Acc: 0.0709
+INFO:master_logger:Epoch[008/300], Step[0150/1602], Avg Loss: 5.4876, Avg Acc: 0.0657
+INFO:local_logger:Epoch[008/300], Step[0150/1602], Avg Loss: 5.5319, Avg Acc: 0.0614
+INFO:local_logger:Epoch[008/300], Step[0150/1602], Avg Loss: 5.5058, Avg Acc: 0.0570
+INFO:local_logger:Epoch[008/300], Step[0200/1602], Avg Loss: 5.5286, Avg Acc: 0.0631
+INFO:local_logger:Epoch[008/300], Step[0200/1602], Avg Loss: 5.4463, Avg Acc: 0.0731
+INFO:local_logger:Epoch[008/300], Step[0200/1602], Avg Loss: 5.4778, Avg Acc: 0.0641
+INFO:local_logger:Epoch[008/300], Step[0200/1602], Avg Loss: 5.4600, Avg Acc: 0.0716
+INFO:master_logger:Epoch[008/300], Step[0200/1602], Avg Loss: 5.4782, Avg Acc: 0.0680
+INFO:local_logger:Epoch[008/300], Step[0250/1602], Avg Loss: 5.4767, Avg Acc: 0.0707
+INFO:local_logger:Epoch[008/300], Step[0250/1602], Avg Loss: 5.4506, Avg Acc: 0.0739
+INFO:local_logger:Epoch[008/300], Step[0250/1602], Avg Loss: 5.5168, Avg Acc: 0.0610
+INFO:local_logger:Epoch[008/300], Step[0250/1602], Avg Loss: 5.4940, Avg Acc: 0.0637
+INFO:master_logger:Epoch[008/300], Step[0250/1602], Avg Loss: 5.4845, Avg Acc: 0.0673
+INFO:local_logger:Epoch[008/300], Step[0300/1602], Avg Loss: 5.4925, Avg Acc: 0.0651
+INFO:local_logger:Epoch[008/300], Step[0300/1602], Avg Loss: 5.4660, Avg Acc: 0.0728
+INFO:local_logger:Epoch[008/300], Step[0300/1602], Avg Loss: 5.5065, Avg Acc: 0.0619
+INFO:local_logger:Epoch[008/300], Step[0300/1602], Avg Loss: 5.4668, Avg Acc: 0.0716
+INFO:master_logger:Epoch[008/300], Step[0300/1602], Avg Loss: 5.4830, Avg Acc: 0.0679
+INFO:local_logger:Epoch[008/300], Step[0350/1602], Avg Loss: 5.4707, Avg Acc: 0.0699
+INFO:local_logger:Epoch[008/300], Step[0350/1602], Avg Loss: 5.4690, Avg Acc: 0.0695
+INFO:local_logger:Epoch[008/300], Step[0350/1602], Avg Loss: 5.4838, Avg Acc: 0.0668
+INFO:local_logger:Epoch[008/300], Step[0350/1602], Avg Loss: 5.4970, Avg Acc: 0.0622
+INFO:master_logger:Epoch[008/300], Step[0350/1602], Avg Loss: 5.4801, Avg Acc: 0.0671
+INFO:local_logger:Epoch[008/300], Step[0400/1602], Avg Loss: 5.4900, Avg Acc: 0.0626
+INFO:local_logger:Epoch[008/300], Step[0400/1602], Avg Loss: 5.4566, Avg Acc: 0.0688
+INFO:local_logger:Epoch[008/300], Step[0400/1602], Avg Loss: 5.4892, Avg Acc: 0.0662
+INFO:local_logger:Epoch[008/300], Step[0400/1602], Avg Loss: 5.4733, Avg Acc: 0.0696
+INFO:master_logger:Epoch[008/300], Step[0400/1602], Avg Loss: 5.4773, Avg Acc: 0.0668
+INFO:local_logger:Epoch[008/300], Step[0450/1602], Avg Loss: 5.4698, Avg Acc: 0.0698
+INFO:local_logger:Epoch[008/300], Step[0450/1602], Avg Loss: 5.4532, Avg Acc: 0.0692
+INFO:local_logger:Epoch[008/300], Step[0450/1602], Avg Loss: 5.4788, Avg Acc: 0.0672
+INFO:local_logger:Epoch[008/300], Step[0450/1602], Avg Loss: 5.4849, Avg Acc: 0.0647
+INFO:master_logger:Epoch[008/300], Step[0450/1602], Avg Loss: 5.4717, Avg Acc: 0.0677
+INFO:local_logger:Epoch[008/300], Step[0500/1602], Avg Loss: 5.4714, Avg Acc: 0.0680
+INFO:local_logger:Epoch[008/300], Step[0500/1602], Avg Loss: 5.4778, Avg Acc: 0.0652
+INFO:local_logger:Epoch[008/300], Step[0500/1602], Avg Loss: 5.4599, Avg Acc: 0.0690
+INFO:local_logger:Epoch[008/300], Step[0500/1602], Avg Loss: 5.4637, Avg Acc: 0.0703
+INFO:master_logger:Epoch[008/300], Step[0500/1602], Avg Loss: 5.4682, Avg Acc: 0.0681
+INFO:local_logger:Epoch[008/300], Step[0550/1602], Avg Loss: 5.4565, Avg Acc: 0.0690
+INFO:master_logger:Epoch[008/300], Step[0550/1602], Avg Loss: 5.4650, Avg Acc: 0.0684
+INFO:local_logger:Epoch[008/300], Step[0550/1602], Avg Loss: 5.4684, Avg Acc: 0.0655
+INFO:local_logger:Epoch[008/300], Step[0550/1602], Avg Loss: 5.4697, Avg Acc: 0.0682
+INFO:local_logger:Epoch[008/300], Step[0550/1602], Avg Loss: 5.4654, Avg Acc: 0.0708
+INFO:local_logger:Epoch[008/300], Step[0600/1602], Avg Loss: 5.4653, Avg Acc: 0.0699
+INFO:local_logger:Epoch[008/300], Step[0600/1602], Avg Loss: 5.4552, Avg Acc: 0.0686
+INFO:master_logger:Epoch[008/300], Step[0600/1602], Avg Loss: 5.4617, Avg Acc: 0.0681
+INFO:local_logger:Epoch[008/300], Step[0600/1602], Avg Loss: 5.4608, Avg Acc: 0.0658
+INFO:local_logger:Epoch[008/300], Step[0600/1602], Avg Loss: 5.4656, Avg Acc: 0.0682
+INFO:local_logger:Epoch[008/300], Step[0650/1602], Avg Loss: 5.4666, Avg Acc: 0.0696
+INFO:local_logger:Epoch[008/300], Step[0650/1602], Avg Loss: 5.4592, Avg Acc: 0.0689
+INFO:local_logger:Epoch[008/300], Step[0650/1602], Avg Loss: 5.4599, Avg Acc: 0.0655
+INFO:local_logger:Epoch[008/300], Step[0650/1602], Avg Loss: 5.4630, Avg Acc: 0.0678
+INFO:master_logger:Epoch[008/300], Step[0650/1602], Avg Loss: 5.4622, Avg Acc: 0.0680
+INFO:local_logger:Epoch[008/300], Step[0700/1602], Avg Loss: 5.4569, Avg Acc: 0.0693
+INFO:master_logger:Epoch[008/300], Step[0700/1602], Avg Loss: 5.4634, Avg Acc: 0.0680
+INFO:local_logger:Epoch[008/300], Step[0700/1602], Avg Loss: 5.4742, Avg Acc: 0.0693
+INFO:local_logger:Epoch[008/300], Step[0700/1602], Avg Loss: 5.4634, Avg Acc: 0.0683
+INFO:local_logger:Epoch[008/300], Step[0700/1602], Avg Loss: 5.4593, Avg Acc: 0.0652
+INFO:local_logger:Epoch[008/300], Step[0750/1602], Avg Loss: 5.4507, Avg Acc: 0.0693
+INFO:local_logger:Epoch[008/300], Step[0750/1602], Avg Loss: 5.4571, Avg Acc: 0.0652
+INFO:local_logger:Epoch[008/300], Step[0750/1602], Avg Loss: 5.4603, Avg Acc: 0.0679
+INFO:local_logger:Epoch[008/300], Step[0750/1602], Avg Loss: 5.4770, Avg Acc: 0.0692
+INFO:master_logger:Epoch[008/300], Step[0750/1602], Avg Loss: 5.4613, Avg Acc: 0.0679
+INFO:local_logger:Epoch[008/300], Step[0800/1602], Avg Loss: 5.4531, Avg Acc: 0.0695
+INFO:local_logger:Epoch[008/300], Step[0800/1602], Avg Loss: 5.4556, Avg Acc: 0.0660
+INFO:local_logger:Epoch[008/300], Step[0800/1602], Avg Loss: 5.4580, Avg Acc: 0.0690
+INFO:local_logger:Epoch[008/300], Step[0800/1602], Avg Loss: 5.4749, Avg Acc: 0.0688
+INFO:master_logger:Epoch[008/300], Step[0800/1602], Avg Loss: 5.4604, Avg Acc: 0.0683
+INFO:local_logger:Epoch[008/300], Step[0850/1602], Avg Loss: 5.4489, Avg Acc: 0.0691
+INFO:local_logger:Epoch[008/300], Step[0850/1602], Avg Loss: 5.4507, Avg Acc: 0.0691
+INFO:local_logger:Epoch[008/300], Step[0850/1602], Avg Loss: 5.4502, Avg Acc: 0.0659
+INFO:local_logger:Epoch[008/300], Step[0850/1602], Avg Loss: 5.4740, Avg Acc: 0.0684
+INFO:master_logger:Epoch[008/300], Step[0850/1602], Avg Loss: 5.4560, Avg Acc: 0.0681
+INFO:local_logger:Epoch[008/300], Step[0900/1602], Avg Loss: 5.4468, Avg Acc: 0.0693
+INFO:local_logger:Epoch[008/300], Step[0900/1602], Avg Loss: 5.4507, Avg Acc: 0.0664
+INFO:local_logger:Epoch[008/300], Step[0900/1602], Avg Loss: 5.4393, Avg Acc: 0.0697
+INFO:local_logger:Epoch[008/300], Step[0900/1602], Avg Loss: 5.4714, Avg Acc: 0.0690
+INFO:master_logger:Epoch[008/300], Step[0900/1602], Avg Loss: 5.4520, Avg Acc: 0.0686
+INFO:local_logger:Epoch[008/300], Step[0950/1602], Avg Loss: 5.4460, Avg Acc: 0.0690
+INFO:local_logger:Epoch[008/300], Step[0950/1602], Avg Loss: 5.4502, Avg Acc: 0.0671
+INFO:local_logger:Epoch[008/300], Step[0950/1602], Avg Loss: 5.4387, Avg Acc: 0.0695
+INFO:local_logger:Epoch[008/300], Step[0950/1602], Avg Loss: 5.4702, Avg Acc: 0.0688
+INFO:master_logger:Epoch[008/300], Step[0950/1602], Avg Loss: 5.4513, Avg Acc: 0.0686
+INFO:local_logger:Epoch[008/300], Step[1000/1602], Avg Loss: 5.4438, Avg Acc: 0.0691
+INFO:local_logger:Epoch[008/300], Step[1000/1602], Avg Loss: 5.4698, Avg Acc: 0.0693
+INFO:local_logger:Epoch[008/300], Step[1000/1602], Avg Loss: 5.4319, Avg Acc: 0.0696
+INFO:local_logger:Epoch[008/300], Step[1000/1602], Avg Loss: 5.4473, Avg Acc: 0.0675
+INFO:master_logger:Epoch[008/300], Step[1000/1602], Avg Loss: 5.4482, Avg Acc: 0.0689
+INFO:local_logger:Epoch[008/300], Step[1050/1602], Avg Loss: 5.4424, Avg Acc: 0.0692
+INFO:local_logger:Epoch[008/300], Step[1050/1602], Avg Loss: 5.4268, Avg Acc: 0.0699
+INFO:local_logger:Epoch[008/300], Step[1050/1602], Avg Loss: 5.4425, Avg Acc: 0.0676
+INFO:local_logger:Epoch[008/300], Step[1050/1602], Avg Loss: 5.4696, Avg Acc: 0.0691
+INFO:master_logger:Epoch[008/300], Step[1050/1602], Avg Loss: 5.4453, Avg Acc: 0.0690
+INFO:local_logger:Epoch[008/300], Step[1100/1602], Avg Loss: 5.4390, Avg Acc: 0.0695
+INFO:local_logger:Epoch[008/300], Step[1100/1602], Avg Loss: 5.4662, Avg Acc: 0.0696
+INFO:local_logger:Epoch[008/300], Step[1100/1602], Avg Loss: 5.4416, Avg Acc: 0.0673
+INFO:local_logger:Epoch[008/300], Step[1100/1602], Avg Loss: 5.4242, Avg Acc: 0.0699
+INFO:master_logger:Epoch[008/300], Step[1100/1602], Avg Loss: 5.4428, Avg Acc: 0.0691
+INFO:local_logger:Epoch[008/300], Step[1150/1602], Avg Loss: 5.4621, Avg Acc: 0.0703
+INFO:local_logger:Epoch[008/300], Step[1150/1602], Avg Loss: 5.4393, Avg Acc: 0.0694
+INFO:local_logger:Epoch[008/300], Step[1150/1602], Avg Loss: 5.4243, Avg Acc: 0.0699
+INFO:local_logger:Epoch[008/300], Step[1150/1602], Avg Loss: 5.4410, Avg Acc: 0.0674
+INFO:master_logger:Epoch[008/300], Step[1150/1602], Avg Loss: 5.4417, Avg Acc: 0.0692
+INFO:local_logger:Epoch[008/300], Step[1200/1602], Avg Loss: 5.4220, Avg Acc: 0.0702
+INFO:local_logger:Epoch[008/300], Step[1200/1602], Avg Loss: 5.4379, Avg Acc: 0.0700
+INFO:local_logger:Epoch[008/300], Step[1200/1602], Avg Loss: 5.4364, Avg Acc: 0.0679
+INFO:local_logger:Epoch[008/300], Step[1200/1602], Avg Loss: 5.4596, Avg Acc: 0.0708
+INFO:master_logger:Epoch[008/300], Step[1200/1602], Avg Loss: 5.4390, Avg Acc: 0.0697
+INFO:local_logger:Epoch[008/300], Step[1250/1602], Avg Loss: 5.4370, Avg Acc: 0.0699
+INFO:master_logger:Epoch[008/300], Step[1250/1602], Avg Loss: 5.4380, Avg Acc: 0.0700
+INFO:local_logger:Epoch[008/300], Step[1250/1602], Avg Loss: 5.4376, Avg Acc: 0.0680
+INFO:local_logger:Epoch[008/300], Step[1250/1602], Avg Loss: 5.4555, Avg Acc: 0.0713
+INFO:local_logger:Epoch[008/300], Step[1250/1602], Avg Loss: 5.4218, Avg Acc: 0.0707
+INFO:local_logger:Epoch[008/300], Step[1300/1602], Avg Loss: 5.4354, Avg Acc: 0.0701
+INFO:local_logger:Epoch[008/300], Step[1300/1602], Avg Loss: 5.4494, Avg Acc: 0.0716
+INFO:local_logger:Epoch[008/300], Step[1300/1602], Avg Loss: 5.4361, Avg Acc: 0.0679
+INFO:local_logger:Epoch[008/300], Step[1300/1602], Avg Loss: 5.4223, Avg Acc: 0.0709
+INFO:master_logger:Epoch[008/300], Step[1300/1602], Avg Loss: 5.4358, Avg Acc: 0.0701
+INFO:local_logger:Epoch[008/300], Step[1350/1602], Avg Loss: 5.4481, Avg Acc: 0.0719
+INFO:local_logger:Epoch[008/300], Step[1350/1602], Avg Loss: 5.4332, Avg Acc: 0.0704
+INFO:local_logger:Epoch[008/300], Step[1350/1602], Avg Loss: 5.4338, Avg Acc: 0.0684
+INFO:local_logger:Epoch[008/300], Step[1350/1602], Avg Loss: 5.4217, Avg Acc: 0.0708
+INFO:master_logger:Epoch[008/300], Step[1350/1602], Avg Loss: 5.4342, Avg Acc: 0.0704
+INFO:local_logger:Epoch[008/300], Step[1400/1602], Avg Loss: 5.4311, Avg Acc: 0.0703
+INFO:local_logger:Epoch[008/300], Step[1400/1602], Avg Loss: 5.4311, Avg Acc: 0.0685
+INFO:local_logger:Epoch[008/300], Step[1400/1602], Avg Loss: 5.4203, Avg Acc: 0.0714
+INFO:local_logger:Epoch[008/300], Step[1400/1602], Avg Loss: 5.4444, Avg Acc: 0.0719
+INFO:master_logger:Epoch[008/300], Step[1400/1602], Avg Loss: 5.4317, Avg Acc: 0.0705
+INFO:local_logger:Epoch[008/300], Step[1450/1602], Avg Loss: 5.4271, Avg Acc: 0.0706
+INFO:local_logger:Epoch[008/300], Step[1450/1602], Avg Loss: 5.4297, Avg Acc: 0.0688
+INFO:local_logger:Epoch[008/300], Step[1450/1602], Avg Loss: 5.4159, Avg Acc: 0.0723
+INFO:local_logger:Epoch[008/300], Step[1450/1602], Avg Loss: 5.4397, Avg Acc: 0.0723
+INFO:master_logger:Epoch[008/300], Step[1450/1602], Avg Loss: 5.4281, Avg Acc: 0.0710
+INFO:local_logger:Epoch[008/300], Step[1500/1602], Avg Loss: 5.4272, Avg Acc: 0.0706
+INFO:local_logger:Epoch[008/300], Step[1500/1602], Avg Loss: 5.4146, Avg Acc: 0.0723
+INFO:local_logger:Epoch[008/300], Step[1500/1602], Avg Loss: 5.4296, Avg Acc: 0.0687
+INFO:master_logger:Epoch[008/300], Step[1500/1602], Avg Loss: 5.4270, Avg Acc: 0.0712
+INFO:local_logger:Epoch[008/300], Step[1500/1602], Avg Loss: 5.4367, Avg Acc: 0.0730
+INFO:local_logger:Epoch[008/300], Step[1550/1602], Avg Loss: 5.4357, Avg Acc: 0.0731
+INFO:local_logger:Epoch[008/300], Step[1550/1602], Avg Loss: 5.4249, Avg Acc: 0.0711
+INFO:local_logger:Epoch[008/300], Step[1550/1602], Avg Loss: 5.4297, Avg Acc: 0.0685
+INFO:local_logger:Epoch[008/300], Step[1550/1602], Avg Loss: 5.4150, Avg Acc: 0.0723
+INFO:master_logger:Epoch[008/300], Step[1550/1602], Avg Loss: 5.4263, Avg Acc: 0.0713
+INFO:local_logger:Epoch[008/300], Step[1600/1602], Avg Loss: 5.4137, Avg Acc: 0.0727
+INFO:local_logger:Epoch[008/300], Step[1600/1602], Avg Loss: 5.4246, Avg Acc: 0.0715
+INFO:local_logger:Epoch[008/300], Step[1600/1602], Avg Loss: 5.4253, Avg Acc: 0.0688
+INFO:master_logger:Epoch[008/300], Step[1600/1602], Avg Loss: 5.4249, Avg Acc: 0.0714
+INFO:local_logger:Epoch[008/300], Step[1600/1602], Avg Loss: 5.4362, Avg Acc: 0.0726
+INFO:local_logger:----- Epoch[008/300], Train Loss: 5.4253, Train Acc: 0.0688, time: 3711.39
+INFO:local_logger:Now training epoch 9. LR=0.000390
+INFO:local_logger:----- Epoch[008/300], Train Loss: 5.4362, Train Acc: 0.0726, time: 3711.51
+INFO:local_logger:Now training epoch 9. LR=0.000390
+INFO:local_logger:----- Epoch[008/300], Train Loss: 5.4244, Train Acc: 0.0715, time: 3711.66
+INFO:master_logger:----- Epoch[008/300], Train Loss: 5.4249, Train Acc: 0.0714, time: 3711.66
+INFO:local_logger:----- Epoch[008/300], Train Loss: 5.4138, Train Acc: 0.0727, time: 3711.92
+INFO:local_logger:Now training epoch 9. LR=0.000390
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-8-Loss-5.424448791467366.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-8-Loss-5.424448791467366.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-8-Loss-5.424448791467366-EMA.pdparams
+INFO:local_logger:Now training epoch 9. LR=0.000390
+INFO:master_logger:Now training epoch 9. LR=0.000390
+INFO:local_logger:Epoch[009/300], Step[0000/1602], Avg Loss: 5.1328, Avg Acc: 0.1200
+INFO:local_logger:Epoch[009/300], Step[0000/1602], Avg Loss: 5.6694, Avg Acc: 0.0250
+INFO:master_logger:Epoch[009/300], Step[0000/1602], Avg Loss: 5.5227, Avg Acc: 0.0575
+INFO:local_logger:Epoch[009/300], Step[0000/1602], Avg Loss: 5.6969, Avg Acc: 0.0500
+INFO:local_logger:Epoch[009/300], Step[0000/1602], Avg Loss: 5.5919, Avg Acc: 0.0350
+INFO:local_logger:Epoch[009/300], Step[0050/1602], Avg Loss: 5.3713, Avg Acc: 0.0745
+INFO:local_logger:Epoch[009/300], Step[0050/1602], Avg Loss: 5.3975, Avg Acc: 0.0836
+INFO:local_logger:Epoch[009/300], Step[0050/1602], Avg Loss: 5.3880, Avg Acc: 0.0814
+INFO:local_logger:Epoch[009/300], Step[0050/1602], Avg Loss: 5.3826, Avg Acc: 0.0868
+INFO:master_logger:Epoch[009/300], Step[0050/1602], Avg Loss: 5.3849, Avg Acc: 0.0816
+INFO:local_logger:Epoch[009/300], Step[0100/1602], Avg Loss: 5.3973, Avg Acc: 0.0839
+INFO:local_logger:Epoch[009/300], Step[0100/1602], Avg Loss: 5.4262, Avg Acc: 0.0760
+INFO:local_logger:Epoch[009/300], Step[0100/1602], Avg Loss: 5.3431, Avg Acc: 0.0794
+INFO:local_logger:Epoch[009/300], Step[0100/1602], Avg Loss: 5.3891, Avg Acc: 0.0725
+INFO:master_logger:Epoch[009/300], Step[0100/1602], Avg Loss: 5.3889, Avg Acc: 0.0779
+INFO:local_logger:Epoch[009/300], Step[0150/1602], Avg Loss: 5.3929, Avg Acc: 0.0748
+INFO:local_logger:Epoch[009/300], Step[0150/1602], Avg Loss: 5.3747, Avg Acc: 0.0865
+INFO:local_logger:Epoch[009/300], Step[0150/1602], Avg Loss: 5.3604, Avg Acc: 0.0822
+INFO:local_logger:Epoch[009/300], Step[0150/1602], Avg Loss: 5.3214, Avg Acc: 0.0764
+INFO:master_logger:Epoch[009/300], Step[0150/1602], Avg Loss: 5.3623, Avg Acc: 0.0800
+INFO:local_logger:Epoch[009/300], Step[0200/1602], Avg Loss: 5.3383, Avg Acc: 0.0839
+INFO:local_logger:Epoch[009/300], Step[0200/1602], Avg Loss: 5.3225, Avg Acc: 0.0775
+INFO:local_logger:Epoch[009/300], Step[0200/1602], Avg Loss: 5.3827, Avg Acc: 0.0772
+INFO:local_logger:Epoch[009/300], Step[0200/1602], Avg Loss: 5.3480, Avg Acc: 0.0801
+INFO:master_logger:Epoch[009/300], Step[0200/1602], Avg Loss: 5.3479, Avg Acc: 0.0797
+INFO:local_logger:Epoch[009/300], Step[0250/1602], Avg Loss: 5.3487, Avg Acc: 0.0829
+INFO:local_logger:Epoch[009/300], Step[0250/1602], Avg Loss: 5.3019, Avg Acc: 0.0780
+INFO:local_logger:Epoch[009/300], Step[0250/1602], Avg Loss: 5.3663, Avg Acc: 0.0777
+INFO:master_logger:Epoch[009/300], Step[0250/1602], Avg Loss: 5.3444, Avg Acc: 0.0789
+INFO:local_logger:Epoch[009/300], Step[0250/1602], Avg Loss: 5.3606, Avg Acc: 0.0769
+INFO:local_logger:Epoch[009/300], Step[0300/1602], Avg Loss: 5.3727, Avg Acc: 0.0794
+INFO:local_logger:Epoch[009/300], Step[0300/1602], Avg Loss: 5.3481, Avg Acc: 0.0832
+INFO:local_logger:Epoch[009/300], Step[0300/1602], Avg Loss: 5.3493, Avg Acc: 0.0777
+INFO:master_logger:Epoch[009/300], Step[0300/1602], Avg Loss: 5.3455, Avg Acc: 0.0802
+INFO:local_logger:Epoch[009/300], Step[0300/1602], Avg Loss: 5.3120, Avg Acc: 0.0805
+INFO:local_logger:Epoch[009/300], Step[0350/1602], Avg Loss: 5.3403, Avg Acc: 0.0833
+INFO:local_logger:Epoch[009/300], Step[0350/1602], Avg Loss: 5.3481, Avg Acc: 0.0820
+INFO:local_logger:Epoch[009/300], Step[0350/1602], Avg Loss: 5.3130, Avg Acc: 0.0793
+INFO:master_logger:Epoch[009/300], Step[0350/1602], Avg Loss: 5.3393, Avg Acc: 0.0806
+INFO:local_logger:Epoch[009/300], Step[0350/1602], Avg Loss: 5.3560, Avg Acc: 0.0779
+INFO:local_logger:Epoch[009/300], Step[0400/1602], Avg Loss: 5.3395, Avg Acc: 0.0825
+INFO:local_logger:Epoch[009/300], Step[0400/1602], Avg Loss: 5.3494, Avg Acc: 0.0802
+INFO:local_logger:Epoch[009/300], Step[0400/1602], Avg Loss: 5.3663, Avg Acc: 0.0787
+INFO:local_logger:Epoch[009/300], Step[0400/1602], Avg Loss: 5.3119, Avg Acc: 0.0783
+INFO:master_logger:Epoch[009/300], Step[0400/1602], Avg Loss: 5.3418, Avg Acc: 0.0799
+INFO:local_logger:Epoch[009/300], Step[0450/1602], Avg Loss: 5.3309, Avg Acc: 0.0844
+INFO:local_logger:Epoch[009/300], Step[0450/1602], Avg Loss: 5.3075, Avg Acc: 0.0784
+INFO:local_logger:Epoch[009/300], Step[0450/1602], Avg Loss: 5.3299, Avg Acc: 0.0816
+INFO:master_logger:Epoch[009/300], Step[0450/1602], Avg Loss: 5.3316, Avg Acc: 0.0811
+INFO:local_logger:Epoch[009/300], Step[0450/1602], Avg Loss: 5.3579, Avg Acc: 0.0798
+INFO:local_logger:Epoch[009/300], Step[0500/1602], Avg Loss: 5.3257, Avg Acc: 0.0847
+INFO:master_logger:Epoch[009/300], Step[0500/1602], Avg Loss: 5.3298, Avg Acc: 0.0809
+INFO:local_logger:Epoch[009/300], Step[0500/1602], Avg Loss: 5.3630, Avg Acc: 0.0791
+INFO:local_logger:Epoch[009/300], Step[0500/1602], Avg Loss: 5.3254, Avg Acc: 0.0814
+INFO:local_logger:Epoch[009/300], Step[0500/1602], Avg Loss: 5.3050, Avg Acc: 0.0784
+INFO:local_logger:Epoch[009/300], Step[0550/1602], Avg Loss: 5.3207, Avg Acc: 0.0833
+INFO:local_logger:Epoch[009/300], Step[0550/1602], Avg Loss: 5.3643, Avg Acc: 0.0797
+INFO:local_logger:Epoch[009/300], Step[0550/1602], Avg Loss: 5.3072, Avg Acc: 0.0797
+INFO:master_logger:Epoch[009/300], Step[0550/1602], Avg Loss: 5.3249, Avg Acc: 0.0816
+INFO:local_logger:Epoch[009/300], Step[0550/1602], Avg Loss: 5.3074, Avg Acc: 0.0838
+INFO:local_logger:Epoch[009/300], Step[0600/1602], Avg Loss: 5.3210, Avg Acc: 0.0827
+INFO:local_logger:Epoch[009/300], Step[0600/1602], Avg Loss: 5.3538, Avg Acc: 0.0800
+INFO:local_logger:Epoch[009/300], Step[0600/1602], Avg Loss: 5.3101, Avg Acc: 0.0832
+INFO:master_logger:Epoch[009/300], Step[0600/1602], Avg Loss: 5.3208, Avg Acc: 0.0813
+INFO:local_logger:Epoch[009/300], Step[0600/1602], Avg Loss: 5.2982, Avg Acc: 0.0795
+INFO:local_logger:Epoch[009/300], Step[0650/1602], Avg Loss: 5.3628, Avg Acc: 0.0786
+INFO:local_logger:Epoch[009/300], Step[0650/1602], Avg Loss: 5.3111, Avg Acc: 0.0827
+INFO:local_logger:Epoch[009/300], Step[0650/1602], Avg Loss: 5.3226, Avg Acc: 0.0823
+INFO:local_logger:Epoch[009/300], Step[0650/1602], Avg Loss: 5.3012, Avg Acc: 0.0797
+INFO:master_logger:Epoch[009/300], Step[0650/1602], Avg Loss: 5.3244, Avg Acc: 0.0808
+INFO:local_logger:Epoch[009/300], Step[0700/1602], Avg Loss: 5.3166, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[0700/1602], Avg Loss: 5.3090, Avg Acc: 0.0828
+INFO:master_logger:Epoch[009/300], Step[0700/1602], Avg Loss: 5.3213, Avg Acc: 0.0810
+INFO:local_logger:Epoch[009/300], Step[0700/1602], Avg Loss: 5.3545, Avg Acc: 0.0790
+INFO:local_logger:Epoch[009/300], Step[0700/1602], Avg Loss: 5.3050, Avg Acc: 0.0803
+INFO:local_logger:Epoch[009/300], Step[0750/1602], Avg Loss: 5.3200, Avg Acc: 0.0814
+INFO:local_logger:Epoch[009/300], Step[0750/1602], Avg Loss: 5.3071, Avg Acc: 0.0796
+INFO:local_logger:Epoch[009/300], Step[0750/1602], Avg Loss: 5.3526, Avg Acc: 0.0786
+INFO:local_logger:Epoch[009/300], Step[0750/1602], Avg Loss: 5.3094, Avg Acc: 0.0820
+INFO:master_logger:Epoch[009/300], Step[0750/1602], Avg Loss: 5.3223, Avg Acc: 0.0804
+INFO:local_logger:Epoch[009/300], Step[0800/1602], Avg Loss: 5.3210, Avg Acc: 0.0812
+INFO:local_logger:Epoch[009/300], Step[0800/1602], Avg Loss: 5.3110, Avg Acc: 0.0821
+INFO:local_logger:Epoch[009/300], Step[0800/1602], Avg Loss: 5.3080, Avg Acc: 0.0796
+INFO:local_logger:Epoch[009/300], Step[0800/1602], Avg Loss: 5.3473, Avg Acc: 0.0797
+INFO:master_logger:Epoch[009/300], Step[0800/1602], Avg Loss: 5.3218, Avg Acc: 0.0806
+INFO:local_logger:Epoch[009/300], Step[0850/1602], Avg Loss: 5.3193, Avg Acc: 0.0805
+INFO:local_logger:Epoch[009/300], Step[0850/1602], Avg Loss: 5.3121, Avg Acc: 0.0798
+INFO:local_logger:Epoch[009/300], Step[0850/1602], Avg Loss: 5.3480, Avg Acc: 0.0789
+INFO:local_logger:Epoch[009/300], Step[0850/1602], Avg Loss: 5.3110, Avg Acc: 0.0829
+INFO:master_logger:Epoch[009/300], Step[0850/1602], Avg Loss: 5.3226, Avg Acc: 0.0805
+INFO:local_logger:Epoch[009/300], Step[0900/1602], Avg Loss: 5.3091, Avg Acc: 0.0839
+INFO:local_logger:Epoch[009/300], Step[0900/1602], Avg Loss: 5.3171, Avg Acc: 0.0807
+INFO:local_logger:Epoch[009/300], Step[0900/1602], Avg Loss: 5.3435, Avg Acc: 0.0800
+INFO:local_logger:Epoch[009/300], Step[0900/1602], Avg Loss: 5.3133, Avg Acc: 0.0793
+INFO:master_logger:Epoch[009/300], Step[0900/1602], Avg Loss: 5.3207, Avg Acc: 0.0810
+INFO:local_logger:Epoch[009/300], Step[0950/1602], Avg Loss: 5.3032, Avg Acc: 0.0840
+INFO:local_logger:Epoch[009/300], Step[0950/1602], Avg Loss: 5.3174, Avg Acc: 0.0813
+INFO:local_logger:Epoch[009/300], Step[0950/1602], Avg Loss: 5.3406, Avg Acc: 0.0802
+INFO:local_logger:Epoch[009/300], Step[0950/1602], Avg Loss: 5.3133, Avg Acc: 0.0796
+INFO:master_logger:Epoch[009/300], Step[0950/1602], Avg Loss: 5.3186, Avg Acc: 0.0813
+INFO:local_logger:Epoch[009/300], Step[1000/1602], Avg Loss: 5.3156, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[1000/1602], Avg Loss: 5.3394, Avg Acc: 0.0805
+INFO:local_logger:Epoch[009/300], Step[1000/1602], Avg Loss: 5.3151, Avg Acc: 0.0798
+INFO:local_logger:Epoch[009/300], Step[1000/1602], Avg Loss: 5.2978, Avg Acc: 0.0843
+INFO:master_logger:Epoch[009/300], Step[1000/1602], Avg Loss: 5.3170, Avg Acc: 0.0816
+INFO:local_logger:Epoch[009/300], Step[1050/1602], Avg Loss: 5.2957, Avg Acc: 0.0845
+INFO:local_logger:Epoch[009/300], Step[1050/1602], Avg Loss: 5.3118, Avg Acc: 0.0814
+INFO:local_logger:Epoch[009/300], Step[1050/1602], Avg Loss: 5.3142, Avg Acc: 0.0807
+INFO:master_logger:Epoch[009/300], Step[1050/1602], Avg Loss: 5.3142, Avg Acc: 0.0819
+INFO:local_logger:Epoch[009/300], Step[1050/1602], Avg Loss: 5.3351, Avg Acc: 0.0809
+INFO:local_logger:Epoch[009/300], Step[1100/1602], Avg Loss: 5.3118, Avg Acc: 0.0813
+INFO:local_logger:Epoch[009/300], Step[1100/1602], Avg Loss: 5.3153, Avg Acc: 0.0809
+INFO:local_logger:Epoch[009/300], Step[1100/1602], Avg Loss: 5.3338, Avg Acc: 0.0812
+INFO:master_logger:Epoch[009/300], Step[1100/1602], Avg Loss: 5.3149, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[1100/1602], Avg Loss: 5.2987, Avg Acc: 0.0838
+INFO:local_logger:Epoch[009/300], Step[1150/1602], Avg Loss: 5.2967, Avg Acc: 0.0839
+INFO:local_logger:Epoch[009/300], Step[1150/1602], Avg Loss: 5.3096, Avg Acc: 0.0812
+INFO:local_logger:Epoch[009/300], Step[1150/1602], Avg Loss: 5.3353, Avg Acc: 0.0813
+INFO:local_logger:Epoch[009/300], Step[1150/1602], Avg Loss: 5.3172, Avg Acc: 0.0806
+INFO:master_logger:Epoch[009/300], Step[1150/1602], Avg Loss: 5.3147, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[1200/1602], Avg Loss: 5.3078, Avg Acc: 0.0809
+INFO:local_logger:Epoch[009/300], Step[1200/1602], Avg Loss: 5.3185, Avg Acc: 0.0807
+INFO:local_logger:Epoch[009/300], Step[1200/1602], Avg Loss: 5.2915, Avg Acc: 0.0838
+INFO:local_logger:Epoch[009/300], Step[1200/1602], Avg Loss: 5.3307, Avg Acc: 0.0814
+INFO:master_logger:Epoch[009/300], Step[1200/1602], Avg Loss: 5.3121, Avg Acc: 0.0817
+INFO:local_logger:Epoch[009/300], Step[1250/1602], Avg Loss: 5.3069, Avg Acc: 0.0811
+INFO:local_logger:Epoch[009/300], Step[1250/1602], Avg Loss: 5.3273, Avg Acc: 0.0815
+INFO:local_logger:Epoch[009/300], Step[1250/1602], Avg Loss: 5.3166, Avg Acc: 0.0808
+INFO:local_logger:Epoch[009/300], Step[1250/1602], Avg Loss: 5.2918, Avg Acc: 0.0840
+INFO:master_logger:Epoch[009/300], Step[1250/1602], Avg Loss: 5.3107, Avg Acc: 0.0819
+INFO:local_logger:Epoch[009/300], Step[1300/1602], Avg Loss: 5.3068, Avg Acc: 0.0809
+INFO:local_logger:Epoch[009/300], Step[1300/1602], Avg Loss: 5.3168, Avg Acc: 0.0806
+INFO:local_logger:Epoch[009/300], Step[1300/1602], Avg Loss: 5.2903, Avg Acc: 0.0840
+INFO:local_logger:Epoch[009/300], Step[1300/1602], Avg Loss: 5.3250, Avg Acc: 0.0818
+INFO:master_logger:Epoch[009/300], Step[1300/1602], Avg Loss: 5.3097, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[1350/1602], Avg Loss: 5.3257, Avg Acc: 0.0817
+INFO:local_logger:Epoch[009/300], Step[1350/1602], Avg Loss: 5.3064, Avg Acc: 0.0816
+INFO:master_logger:Epoch[009/300], Step[1350/1602], Avg Loss: 5.3100, Avg Acc: 0.0820
+INFO:local_logger:Epoch[009/300], Step[1350/1602], Avg Loss: 5.3139, Avg Acc: 0.0808
+INFO:local_logger:Epoch[009/300], Step[1350/1602], Avg Loss: 5.2940, Avg Acc: 0.0839
+INFO:local_logger:Epoch[009/300], Step[1400/1602], Avg Loss: 5.3235, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[1400/1602], Avg Loss: 5.3029, Avg Acc: 0.0818
+INFO:local_logger:Epoch[009/300], Step[1400/1602], Avg Loss: 5.2947, Avg Acc: 0.0837
+INFO:master_logger:Epoch[009/300], Step[1400/1602], Avg Loss: 5.3086, Avg Acc: 0.0821
+INFO:local_logger:Epoch[009/300], Step[1400/1602], Avg Loss: 5.3133, Avg Acc: 0.0812
+INFO:local_logger:Epoch[009/300], Step[1450/1602], Avg Loss: 5.3011, Avg Acc: 0.0821
+INFO:local_logger:Epoch[009/300], Step[1450/1602], Avg Loss: 5.3247, Avg Acc: 0.0814
+INFO:local_logger:Epoch[009/300], Step[1450/1602], Avg Loss: 5.3106, Avg Acc: 0.0817
+INFO:master_logger:Epoch[009/300], Step[1450/1602], Avg Loss: 5.3076, Avg Acc: 0.0822
+INFO:local_logger:Epoch[009/300], Step[1450/1602], Avg Loss: 5.2942, Avg Acc: 0.0837
+INFO:local_logger:Epoch[009/300], Step[1500/1602], Avg Loss: 5.2980, Avg Acc: 0.0827
+INFO:local_logger:Epoch[009/300], Step[1500/1602], Avg Loss: 5.3214, Avg Acc: 0.0816
+INFO:local_logger:Epoch[009/300], Step[1500/1602], Avg Loss: 5.3074, Avg Acc: 0.0816
+INFO:local_logger:Epoch[009/300], Step[1500/1602], Avg Loss: 5.2926, Avg Acc: 0.0839
+INFO:master_logger:Epoch[009/300], Step[1500/1602], Avg Loss: 5.3048, Avg Acc: 0.0824
+INFO:local_logger:Epoch[009/300], Step[1550/1602], Avg Loss: 5.2939, Avg Acc: 0.0826
+INFO:local_logger:Epoch[009/300], Step[1550/1602], Avg Loss: 5.3060, Avg Acc: 0.0816
+INFO:local_logger:Epoch[009/300], Step[1550/1602], Avg Loss: 5.3174, Avg Acc: 0.0813
+INFO:master_logger:Epoch[009/300], Step[1550/1602], Avg Loss: 5.3023, Avg Acc: 0.0822
+INFO:local_logger:Epoch[009/300], Step[1550/1602], Avg Loss: 5.2919, Avg Acc: 0.0833
+INFO:local_logger:Epoch[009/300], Step[1600/1602], Avg Loss: 5.2974, Avg Acc: 0.0826
+INFO:local_logger:Epoch[009/300], Step[1600/1602], Avg Loss: 5.2856, Avg Acc: 0.0836
+INFO:local_logger:Epoch[009/300], Step[1600/1602], Avg Loss: 5.3027, Avg Acc: 0.0821
+INFO:local_logger:Epoch[009/300], Step[1600/1602], Avg Loss: 5.3149, Avg Acc: 0.0812
+INFO:master_logger:Epoch[009/300], Step[1600/1602], Avg Loss: 5.3002, Avg Acc: 0.0824
+INFO:local_logger:----- Epoch[009/300], Train Loss: 5.2975, Train Acc: 0.0826, time: 3690.95
+INFO:master_logger:----- Epoch[009/300], Train Loss: 5.3002, Train Acc: 0.0824, time: 3690.95
+INFO:local_logger:----- Epoch[009/300], Train Loss: 5.3028, Train Acc: 0.0821, time: 3691.19
+INFO:local_logger:Now training epoch 10. LR=0.000390
+INFO:local_logger:----- Epoch[009/300], Train Loss: 5.3150, Train Acc: 0.0812, time: 3691.72
+INFO:local_logger:Now training epoch 10. LR=0.000390
+INFO:local_logger:----- Epoch[009/300], Train Loss: 5.2855, Train Acc: 0.0836, time: 3691.63
+INFO:local_logger:Now training epoch 10. LR=0.000390
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-9-Loss-5.297480824286283.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-9-Loss-5.297480824286283.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-9-Loss-5.297480824286283-EMA.pdparams
+INFO:local_logger:Now training epoch 10. LR=0.000390
+INFO:master_logger:Now training epoch 10. LR=0.000390
+INFO:local_logger:Epoch[010/300], Step[0000/1602], Avg Loss: 5.6238, Avg Acc: 0.1250
+INFO:local_logger:Epoch[010/300], Step[0000/1602], Avg Loss: 5.2373, Avg Acc: 0.0050
+INFO:local_logger:Epoch[010/300], Step[0000/1602], Avg Loss: 5.1772, Avg Acc: 0.0200
+INFO:local_logger:Epoch[010/300], Step[0000/1602], Avg Loss: 5.5371, Avg Acc: 0.0600
+INFO:master_logger:Epoch[010/300], Step[0000/1602], Avg Loss: 5.3938, Avg Acc: 0.0525
+INFO:local_logger:Epoch[010/300], Step[0050/1602], Avg Loss: 5.2369, Avg Acc: 0.0794
+INFO:local_logger:Epoch[010/300], Step[0050/1602], Avg Loss: 5.2833, Avg Acc: 0.0912
+INFO:local_logger:Epoch[010/300], Step[0050/1602], Avg Loss: 5.1905, Avg Acc: 0.0902
+INFO:master_logger:Epoch[010/300], Step[0050/1602], Avg Loss: 5.2361, Avg Acc: 0.0874
+INFO:local_logger:Epoch[010/300], Step[0050/1602], Avg Loss: 5.2336, Avg Acc: 0.0886
+INFO:local_logger:Epoch[010/300], Step[0100/1602], Avg Loss: 5.2134, Avg Acc: 0.0979
+INFO:local_logger:Epoch[010/300], Step[0100/1602], Avg Loss: 5.2777, Avg Acc: 0.0867
+INFO:local_logger:Epoch[010/300], Step[0100/1602], Avg Loss: 5.2623, Avg Acc: 0.0915
+INFO:local_logger:Epoch[010/300], Step[0100/1602], Avg Loss: 5.2243, Avg Acc: 0.0971
+INFO:master_logger:Epoch[010/300], Step[0100/1602], Avg Loss: 5.2444, Avg Acc: 0.0933
+INFO:local_logger:Epoch[010/300], Step[0150/1602], Avg Loss: 5.2963, Avg Acc: 0.0915
+INFO:local_logger:Epoch[010/300], Step[0150/1602], Avg Loss: 5.2064, Avg Acc: 0.0957
+INFO:local_logger:Epoch[010/300], Step[0150/1602], Avg Loss: 5.1959, Avg Acc: 0.0957
+INFO:local_logger:Epoch[010/300], Step[0150/1602], Avg Loss: 5.2499, Avg Acc: 0.0901
+INFO:master_logger:Epoch[010/300], Step[0150/1602], Avg Loss: 5.2371, Avg Acc: 0.0933
+INFO:local_logger:Epoch[010/300], Step[0200/1602], Avg Loss: 5.2806, Avg Acc: 0.0877
+INFO:local_logger:Epoch[010/300], Step[0200/1602], Avg Loss: 5.2312, Avg Acc: 0.0895
+INFO:local_logger:Epoch[010/300], Step[0200/1602], Avg Loss: 5.2203, Avg Acc: 0.0914
+INFO:local_logger:Epoch[010/300], Step[0200/1602], Avg Loss: 5.2059, Avg Acc: 0.0943
+INFO:master_logger:Epoch[010/300], Step[0200/1602], Avg Loss: 5.2345, Avg Acc: 0.0907
+INFO:local_logger:Epoch[010/300], Step[0250/1602], Avg Loss: 5.2513, Avg Acc: 0.0892
+INFO:local_logger:Epoch[010/300], Step[0250/1602], Avg Loss: 5.1871, Avg Acc: 0.0969
+INFO:local_logger:Epoch[010/300], Step[0250/1602], Avg Loss: 5.2282, Avg Acc: 0.0895
+INFO:master_logger:Epoch[010/300], Step[0250/1602], Avg Loss: 5.2211, Avg Acc: 0.0913
+INFO:local_logger:Epoch[010/300], Step[0250/1602], Avg Loss: 5.2180, Avg Acc: 0.0895
+INFO:local_logger:Epoch[010/300], Step[0300/1602], Avg Loss: 5.2539, Avg Acc: 0.0893
+INFO:local_logger:Epoch[010/300], Step[0300/1602], Avg Loss: 5.2314, Avg Acc: 0.0895
+INFO:local_logger:Epoch[010/300], Step[0300/1602], Avg Loss: 5.2358, Avg Acc: 0.0912
+INFO:local_logger:Epoch[010/300], Step[0300/1602], Avg Loss: 5.1665, Avg Acc: 0.0956
+INFO:master_logger:Epoch[010/300], Step[0300/1602], Avg Loss: 5.2219, Avg Acc: 0.0914
+INFO:local_logger:Epoch[010/300], Step[0350/1602], Avg Loss: 5.2456, Avg Acc: 0.0894
+INFO:local_logger:Epoch[010/300], Step[0350/1602], Avg Loss: 5.2445, Avg Acc: 0.0882
+INFO:local_logger:Epoch[010/300], Step[0350/1602], Avg Loss: 5.2372, Avg Acc: 0.0884
+INFO:local_logger:Epoch[010/300], Step[0350/1602], Avg Loss: 5.1503, Avg Acc: 0.0959
+INFO:master_logger:Epoch[010/300], Step[0350/1602], Avg Loss: 5.2194, Avg Acc: 0.0905
+INFO:local_logger:Epoch[010/300], Step[0400/1602], Avg Loss: 5.2358, Avg Acc: 0.0870
+INFO:local_logger:Epoch[010/300], Step[0400/1602], Avg Loss: 5.2355, Avg Acc: 0.0881
+INFO:local_logger:Epoch[010/300], Step[0400/1602], Avg Loss: 5.1630, Avg Acc: 0.0925
+INFO:local_logger:Epoch[010/300], Step[0400/1602], Avg Loss: 5.2428, Avg Acc: 0.0888
+INFO:master_logger:Epoch[010/300], Step[0400/1602], Avg Loss: 5.2193, Avg Acc: 0.0891
+INFO:local_logger:Epoch[010/300], Step[0450/1602], Avg Loss: 5.2240, Avg Acc: 0.0881
+INFO:local_logger:Epoch[010/300], Step[0450/1602], Avg Loss: 5.2404, Avg Acc: 0.0896
+INFO:local_logger:Epoch[010/300], Step[0450/1602], Avg Loss: 5.2232, Avg Acc: 0.0907
+INFO:local_logger:Epoch[010/300], Step[0450/1602], Avg Loss: 5.1811, Avg Acc: 0.0903
+INFO:master_logger:Epoch[010/300], Step[0450/1602], Avg Loss: 5.2172, Avg Acc: 0.0897
+INFO:local_logger:Epoch[010/300], Step[0500/1602], Avg Loss: 5.2281, Avg Acc: 0.0883
+INFO:local_logger:Epoch[010/300], Step[0500/1602], Avg Loss: 5.2199, Avg Acc: 0.0911
+INFO:local_logger:Epoch[010/300], Step[0500/1602], Avg Loss: 5.2282, Avg Acc: 0.0912
+INFO:master_logger:Epoch[010/300], Step[0500/1602], Avg Loss: 5.2158, Avg Acc: 0.0901
+INFO:local_logger:Epoch[010/300], Step[0500/1602], Avg Loss: 5.1869, Avg Acc: 0.0899
+INFO:local_logger:Epoch[010/300], Step[0550/1602], Avg Loss: 5.2224, Avg Acc: 0.0878
+INFO:local_logger:Epoch[010/300], Step[0550/1602], Avg Loss: 5.2218, Avg Acc: 0.0902
+INFO:local_logger:Epoch[010/300], Step[0550/1602], Avg Loss: 5.1865, Avg Acc: 0.0904
+INFO:local_logger:Epoch[010/300], Step[0550/1602], Avg Loss: 5.2143, Avg Acc: 0.0916
+INFO:master_logger:Epoch[010/300], Step[0550/1602], Avg Loss: 5.2113, Avg Acc: 0.0900
+INFO:local_logger:Epoch[010/300], Step[0600/1602], Avg Loss: 5.2329, Avg Acc: 0.0876
+INFO:local_logger:Epoch[010/300], Step[0600/1602], Avg Loss: 5.2130, Avg Acc: 0.0912
+INFO:local_logger:Epoch[010/300], Step[0600/1602], Avg Loss: 5.1953, Avg Acc: 0.0902
+INFO:local_logger:Epoch[010/300], Step[0600/1602], Avg Loss: 5.2218, Avg Acc: 0.0916
+INFO:master_logger:Epoch[010/300], Step[0600/1602], Avg Loss: 5.2157, Avg Acc: 0.0901
+INFO:local_logger:Epoch[010/300], Step[0650/1602], Avg Loss: 5.2240, Avg Acc: 0.0891
+INFO:local_logger:Epoch[010/300], Step[0650/1602], Avg Loss: 5.1940, Avg Acc: 0.0905
+INFO:local_logger:Epoch[010/300], Step[0650/1602], Avg Loss: 5.2157, Avg Acc: 0.0917
+INFO:local_logger:Epoch[010/300], Step[0650/1602], Avg Loss: 5.2105, Avg Acc: 0.0911
+INFO:master_logger:Epoch[010/300], Step[0650/1602], Avg Loss: 5.2110, Avg Acc: 0.0906
+INFO:local_logger:Epoch[010/300], Step[0700/1602], Avg Loss: 5.2212, Avg Acc: 0.0888
+INFO:local_logger:Epoch[010/300], Step[0700/1602], Avg Loss: 5.2141, Avg Acc: 0.0893
+INFO:local_logger:Epoch[010/300], Step[0700/1602], Avg Loss: 5.2110, Avg Acc: 0.0913
+INFO:local_logger:Epoch[010/300], Step[0700/1602], Avg Loss: 5.1960, Avg Acc: 0.0905
+INFO:master_logger:Epoch[010/300], Step[0700/1602], Avg Loss: 5.2106, Avg Acc: 0.0900
+INFO:local_logger:Epoch[010/300], Step[0750/1602], Avg Loss: 5.2173, Avg Acc: 0.0889
+INFO:local_logger:Epoch[010/300], Step[0750/1602], Avg Loss: 5.2018, Avg Acc: 0.0904
+INFO:local_logger:Epoch[010/300], Step[0750/1602], Avg Loss: 5.2090, Avg Acc: 0.0920
+INFO:local_logger:Epoch[010/300], Step[0750/1602], Avg Loss: 5.2212, Avg Acc: 0.0887
+INFO:master_logger:Epoch[010/300], Step[0750/1602], Avg Loss: 5.2123, Avg Acc: 0.0900
+INFO:local_logger:Epoch[010/300], Step[0800/1602], Avg Loss: 5.2155, Avg Acc: 0.0898
+INFO:local_logger:Epoch[010/300], Step[0800/1602], Avg Loss: 5.2052, Avg Acc: 0.0922
+INFO:local_logger:Epoch[010/300], Step[0800/1602], Avg Loss: 5.1979, Avg Acc: 0.0901
+INFO:local_logger:Epoch[010/300], Step[0800/1602], Avg Loss: 5.2124, Avg Acc: 0.0884
+INFO:master_logger:Epoch[010/300], Step[0800/1602], Avg Loss: 5.2077, Avg Acc: 0.0901
+INFO:local_logger:Epoch[010/300], Step[0850/1602], Avg Loss: 5.2153, Avg Acc: 0.0887
+INFO:local_logger:Epoch[010/300], Step[0850/1602], Avg Loss: 5.1971, Avg Acc: 0.0903
+INFO:local_logger:Epoch[010/300], Step[0850/1602], Avg Loss: 5.2004, Avg Acc: 0.0926
+INFO:local_logger:Epoch[010/300], Step[0850/1602], Avg Loss: 5.2080, Avg Acc: 0.0906
+INFO:master_logger:Epoch[010/300], Step[0850/1602], Avg Loss: 5.2052, Avg Acc: 0.0905
+INFO:local_logger:Epoch[010/300], Step[0900/1602], Avg Loss: 5.2053, Avg Acc: 0.0913
+INFO:local_logger:Epoch[010/300], Step[0900/1602], Avg Loss: 5.1982, Avg Acc: 0.0900
+INFO:local_logger:Epoch[010/300], Step[0900/1602], Avg Loss: 5.2012, Avg Acc: 0.0926
+INFO:local_logger:Epoch[010/300], Step[0900/1602], Avg Loss: 5.2114, Avg Acc: 0.0892
+INFO:master_logger:Epoch[010/300], Step[0900/1602], Avg Loss: 5.2040, Avg Acc: 0.0908
+INFO:local_logger:Epoch[010/300], Step[0950/1602], Avg Loss: 5.2072, Avg Acc: 0.0915
+INFO:local_logger:Epoch[010/300], Step[0950/1602], Avg Loss: 5.2141, Avg Acc: 0.0888
+INFO:local_logger:Epoch[010/300], Step[0950/1602], Avg Loss: 5.1947, Avg Acc: 0.0904
+INFO:local_logger:Epoch[010/300], Step[0950/1602], Avg Loss: 5.1959, Avg Acc: 0.0930
+INFO:master_logger:Epoch[010/300], Step[0950/1602], Avg Loss: 5.2030, Avg Acc: 0.0909
+INFO:local_logger:Epoch[010/300], Step[1000/1602], Avg Loss: 5.2056, Avg Acc: 0.0922
+INFO:local_logger:Epoch[010/300], Step[1000/1602], Avg Loss: 5.1938, Avg Acc: 0.0906
+INFO:local_logger:Epoch[010/300], Step[1000/1602], Avg Loss: 5.1902, Avg Acc: 0.0936
+INFO:master_logger:Epoch[010/300], Step[1000/1602], Avg Loss: 5.1995, Avg Acc: 0.0911
+INFO:local_logger:Epoch[010/300], Step[1000/1602], Avg Loss: 5.2084, Avg Acc: 0.0882
+INFO:local_logger:Epoch[010/300], Step[1050/1602], Avg Loss: 5.2038, Avg Acc: 0.0921
+INFO:local_logger:Epoch[010/300], Step[1050/1602], Avg Loss: 5.2039, Avg Acc: 0.0887
+INFO:local_logger:Epoch[010/300], Step[1050/1602], Avg Loss: 5.1900, Avg Acc: 0.0935
+INFO:local_logger:Epoch[010/300], Step[1050/1602], Avg Loss: 5.1906, Avg Acc: 0.0907
+INFO:master_logger:Epoch[010/300], Step[1050/1602], Avg Loss: 5.1971, Avg Acc: 0.0913
+INFO:local_logger:Epoch[010/300], Step[1100/1602], Avg Loss: 5.2039, Avg Acc: 0.0927
+INFO:local_logger:Epoch[010/300], Step[1100/1602], Avg Loss: 5.1915, Avg Acc: 0.0901
+INFO:local_logger:Epoch[010/300], Step[1100/1602], Avg Loss: 5.1868, Avg Acc: 0.0939
+INFO:local_logger:Epoch[010/300], Step[1100/1602], Avg Loss: 5.2083, Avg Acc: 0.0887
+INFO:master_logger:Epoch[010/300], Step[1100/1602], Avg Loss: 5.1976, Avg Acc: 0.0913
+INFO:local_logger:Epoch[010/300], Step[1150/1602], Avg Loss: 5.1983, Avg Acc: 0.0926
+INFO:local_logger:Epoch[010/300], Step[1150/1602], Avg Loss: 5.1862, Avg Acc: 0.0907
+INFO:local_logger:Epoch[010/300], Step[1150/1602], Avg Loss: 5.1857, Avg Acc: 0.0945
+INFO:local_logger:Epoch[010/300], Step[1150/1602], Avg Loss: 5.2024, Avg Acc: 0.0895
+INFO:master_logger:Epoch[010/300], Step[1150/1602], Avg Loss: 5.1931, Avg Acc: 0.0918
+INFO:local_logger:Epoch[010/300], Step[1200/1602], Avg Loss: 5.1844, Avg Acc: 0.0906
+INFO:local_logger:Epoch[010/300], Step[1200/1602], Avg Loss: 5.1888, Avg Acc: 0.0941
+INFO:local_logger:Epoch[010/300], Step[1200/1602], Avg Loss: 5.1940, Avg Acc: 0.0932
+INFO:local_logger:Epoch[010/300], Step[1200/1602], Avg Loss: 5.2020, Avg Acc: 0.0893
+INFO:master_logger:Epoch[010/300], Step[1200/1602], Avg Loss: 5.1923, Avg Acc: 0.0918
+INFO:local_logger:Epoch[010/300], Step[1250/1602], Avg Loss: 5.1951, Avg Acc: 0.0936
+INFO:local_logger:Epoch[010/300], Step[1250/1602], Avg Loss: 5.2016, Avg Acc: 0.0895
+INFO:local_logger:Epoch[010/300], Step[1250/1602], Avg Loss: 5.1888, Avg Acc: 0.0940
+INFO:local_logger:Epoch[010/300], Step[1250/1602], Avg Loss: 5.1828, Avg Acc: 0.0910
+INFO:master_logger:Epoch[010/300], Step[1250/1602], Avg Loss: 5.1921, Avg Acc: 0.0920
+INFO:local_logger:Epoch[010/300], Step[1300/1602], Avg Loss: 5.1883, Avg Acc: 0.0936
+INFO:local_logger:Epoch[010/300], Step[1300/1602], Avg Loss: 5.1965, Avg Acc: 0.0933
+INFO:local_logger:Epoch[010/300], Step[1300/1602], Avg Loss: 5.1801, Avg Acc: 0.0911
+INFO:local_logger:Epoch[010/300], Step[1300/1602], Avg Loss: 5.2007, Avg Acc: 0.0897
+INFO:master_logger:Epoch[010/300], Step[1300/1602], Avg Loss: 5.1914, Avg Acc: 0.0920
+INFO:local_logger:Epoch[010/300], Step[1350/1602], Avg Loss: 5.1937, Avg Acc: 0.0904
+INFO:local_logger:Epoch[010/300], Step[1350/1602], Avg Loss: 5.1806, Avg Acc: 0.0917
+INFO:local_logger:Epoch[010/300], Step[1350/1602], Avg Loss: 5.1843, Avg Acc: 0.0929
+INFO:local_logger:Epoch[010/300], Step[1350/1602], Avg Loss: 5.1976, Avg Acc: 0.0930
+INFO:master_logger:Epoch[010/300], Step[1350/1602], Avg Loss: 5.1890, Avg Acc: 0.0920
+INFO:local_logger:Epoch[010/300], Step[1400/1602], Avg Loss: 5.1922, Avg Acc: 0.0908
+INFO:local_logger:Epoch[010/300], Step[1400/1602], Avg Loss: 5.1981, Avg Acc: 0.0926
+INFO:local_logger:Epoch[010/300], Step[1400/1602], Avg Loss: 5.1836, Avg Acc: 0.0933
+INFO:local_logger:Epoch[010/300], Step[1400/1602], Avg Loss: 5.1817, Avg Acc: 0.0917
+INFO:master_logger:Epoch[010/300], Step[1400/1602], Avg Loss: 5.1889, Avg Acc: 0.0921
+INFO:local_logger:Epoch[010/300], Step[1450/1602], Avg Loss: 5.1959, Avg Acc: 0.0927
+INFO:local_logger:Epoch[010/300], Step[1450/1602], Avg Loss: 5.1907, Avg Acc: 0.0908
+INFO:local_logger:Epoch[010/300], Step[1450/1602], Avg Loss: 5.1852, Avg Acc: 0.0917
+INFO:master_logger:Epoch[010/300], Step[1450/1602], Avg Loss: 5.1893, Avg Acc: 0.0922
+INFO:local_logger:Epoch[010/300], Step[1450/1602], Avg Loss: 5.1857, Avg Acc: 0.0933
+INFO:local_logger:Epoch[010/300], Step[1500/1602], Avg Loss: 5.1922, Avg Acc: 0.0921
+INFO:local_logger:Epoch[010/300], Step[1500/1602], Avg Loss: 5.1885, Avg Acc: 0.0910
+INFO:local_logger:Epoch[010/300], Step[1500/1602], Avg Loss: 5.1850, Avg Acc: 0.0935
+INFO:local_logger:Epoch[010/300], Step[1500/1602], Avg Loss: 5.1808, Avg Acc: 0.0921
+INFO:master_logger:Epoch[010/300], Step[1500/1602], Avg Loss: 5.1866, Avg Acc: 0.0921
+INFO:local_logger:Epoch[010/300], Step[1550/1602], Avg Loss: 5.1903, Avg Acc: 0.0917
+INFO:local_logger:Epoch[010/300], Step[1550/1602], Avg Loss: 5.1877, Avg Acc: 0.0911
+INFO:local_logger:Epoch[010/300], Step[1550/1602], Avg Loss: 5.1748, Avg Acc: 0.0923
+INFO:local_logger:Epoch[010/300], Step[1550/1602], Avg Loss: 5.1848, Avg Acc: 0.0933
+INFO:master_logger:Epoch[010/300], Step[1550/1602], Avg Loss: 5.1844, Avg Acc: 0.0921
+INFO:local_logger:Epoch[010/300], Step[1600/1602], Avg Loss: 5.1895, Avg Acc: 0.0915
+INFO:local_logger:Epoch[010/300], Step[1600/1602], Avg Loss: 5.1739, Avg Acc: 0.0925
+INFO:local_logger:Epoch[010/300], Step[1600/1602], Avg Loss: 5.1826, Avg Acc: 0.0938
+INFO:local_logger:Epoch[010/300], Step[1600/1602], Avg Loss: 5.1852, Avg Acc: 0.0913
+INFO:master_logger:Epoch[010/300], Step[1600/1602], Avg Loss: 5.1828, Avg Acc: 0.0923
+INFO:local_logger:----- Epoch[010/300], Train Loss: 5.1739, Train Acc: 0.0925, time: 3707.69
+INFO:local_logger:----- Validation after Epoch: 10
+INFO:local_logger:----- Epoch[010/300], Train Loss: 5.1894, Train Acc: 0.0914, time: 3707.68
+INFO:master_logger:----- Epoch[010/300], Train Loss: 5.1827, Train Acc: 0.0923, time: 3707.68
+INFO:local_logger:----- Validation after Epoch: 10
+INFO:master_logger:----- Validation after Epoch: 10
+INFO:local_logger:----- Epoch[010/300], Train Loss: 5.1827, Train Acc: 0.0938, time: 3707.86
+INFO:local_logger:----- Validation after Epoch: 10
+INFO:local_logger:----- Epoch[010/300], Train Loss: 5.1850, Train Acc: 0.0913, time: 3707.84
+INFO:local_logger:----- Validation after Epoch: 10
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 1.2163, Avg Acc@1: 0.8750, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 1.9182, Avg Acc@1: 0.8750, Avg Acc@5: 0.8750
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.9898, Avg Acc@1: 0.8750, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 2.4453, Avg Acc@1: 0.5000, Avg Acc@5: 0.6250
+INFO:master_logger:Val Step[0000/1563], Avg Loss: 1.6424, Avg Acc@1: 0.7812, Avg Acc@5: 0.8750
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 2.5059, Avg Acc@1: 0.4926, Avg Acc@5: 0.7304
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 2.3762, Avg Acc@1: 0.4681, Avg Acc@5: 0.7377
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 2.4304, Avg Acc@1: 0.4706, Avg Acc@5: 0.7672
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 2.6203, Avg Acc@1: 0.4485, Avg Acc@5: 0.7279
+INFO:master_logger:Val Step[0050/1563], Avg Loss: 2.4832, Avg Acc@1: 0.4700, Avg Acc@5: 0.7408
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.9353, Avg Acc@1: 0.3874, Avg Acc@5: 0.6448
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.9726, Avg Acc@1: 0.3837, Avg Acc@5: 0.6510
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.9258, Avg Acc@1: 0.3601, Avg Acc@5: 0.6448
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 3.0354, Avg Acc@1: 0.3762, Avg Acc@5: 0.6213
+INFO:master_logger:Val Step[0100/1563], Avg Loss: 2.9673, Avg Acc@1: 0.3769, Avg Acc@5: 0.6405
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 2.8642, Avg Acc@1: 0.4048, Avg Acc@5: 0.6614
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 2.7781, Avg Acc@1: 0.4131, Avg Acc@5: 0.6672
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 2.8302, Avg Acc@1: 0.3924, Avg Acc@5: 0.6548
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 2.8917, Avg Acc@1: 0.4023, Avg Acc@5: 0.6457
+INFO:master_logger:Val Step[0150/1563], Avg Loss: 2.8410, Avg Acc@1: 0.4031, Avg Acc@5: 0.6573
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 2.8843, Avg Acc@1: 0.3862, Avg Acc@5: 0.6430
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 2.8197, Avg Acc@1: 0.4073, Avg Acc@5: 0.6604
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 2.9131, Avg Acc@1: 0.3961, Avg Acc@5: 0.6461
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 2.8844, Avg Acc@1: 0.3955, Avg Acc@5: 0.6505
+INFO:master_logger:Val Step[0200/1563], Avg Loss: 2.8754, Avg Acc@1: 0.3963, Avg Acc@5: 0.6500
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 2.7909, Avg Acc@1: 0.4089, Avg Acc@5: 0.6638
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 2.7250, Avg Acc@1: 0.4198, Avg Acc@5: 0.6783
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 2.8151, Avg Acc@1: 0.4009, Avg Acc@5: 0.6529
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 2.8176, Avg Acc@1: 0.4069, Avg Acc@5: 0.6604
+INFO:master_logger:Val Step[0250/1563], Avg Loss: 2.7871, Avg Acc@1: 0.4091, Avg Acc@5: 0.6638
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 2.8468, Avg Acc@1: 0.3854, Avg Acc@5: 0.6520
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 2.9070, Avg Acc@1: 0.3750, Avg Acc@5: 0.6395
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 2.9234, Avg Acc@1: 0.3688, Avg Acc@5: 0.6304
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 2.9314, Avg Acc@1: 0.3779, Avg Acc@5: 0.6366
+INFO:master_logger:Val Step[0300/1563], Avg Loss: 2.9021, Avg Acc@1: 0.3768, Avg Acc@5: 0.6396
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.9781, Avg Acc@1: 0.3575, Avg Acc@5: 0.6197
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.9553, Avg Acc@1: 0.3618, Avg Acc@5: 0.6314
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.8926, Avg Acc@1: 0.3714, Avg Acc@5: 0.6414
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.9592, Avg Acc@1: 0.3682, Avg Acc@5: 0.6307
+INFO:master_logger:Val Step[0350/1563], Avg Loss: 2.9463, Avg Acc@1: 0.3648, Avg Acc@5: 0.6308
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.9673, Avg Acc@1: 0.3535, Avg Acc@5: 0.6278
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.9787, Avg Acc@1: 0.3550, Avg Acc@5: 0.6194
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.8881, Avg Acc@1: 0.3688, Avg Acc@5: 0.6459
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.9661, Avg Acc@1: 0.3588, Avg Acc@5: 0.6303
+INFO:master_logger:Val Step[0400/1563], Avg Loss: 2.9501, Avg Acc@1: 0.3590, Avg Acc@5: 0.6308
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.9878, Avg Acc@1: 0.3470, Avg Acc@5: 0.6211
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.9965, Avg Acc@1: 0.3481, Avg Acc@5: 0.6186
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.9735, Avg Acc@1: 0.3539, Avg Acc@5: 0.6308
+INFO:master_logger:Val Step[0450/1563], Avg Loss: 2.9695, Avg Acc@1: 0.3527, Avg Acc@5: 0.6280
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.9202, Avg Acc@1: 0.3617, Avg Acc@5: 0.6414
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.9057, Avg Acc@1: 0.3635, Avg Acc@5: 0.6427
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.9689, Avg Acc@1: 0.3518, Avg Acc@5: 0.6245
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.9349, Avg Acc@1: 0.3593, Avg Acc@5: 0.6387
+INFO:master_logger:Val Step[0500/1563], Avg Loss: 2.9458, Avg Acc@1: 0.3575, Avg Acc@5: 0.6320
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.9739, Avg Acc@1: 0.3555, Avg Acc@5: 0.6220
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.9481, Avg Acc@1: 0.3616, Avg Acc@5: 0.6277
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.8532, Avg Acc@1: 0.3721, Avg Acc@5: 0.6518
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.9081, Avg Acc@1: 0.3662, Avg Acc@5: 0.6425
+INFO:master_logger:Val Step[0550/1563], Avg Loss: 2.9101, Avg Acc@1: 0.3650, Avg Acc@5: 0.6386
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.9308, Avg Acc@1: 0.3603, Avg Acc@5: 0.6323
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.9505, Avg Acc@1: 0.3634, Avg Acc@5: 0.6275
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.9256, Avg Acc@1: 0.3652, Avg Acc@5: 0.6391
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.9372, Avg Acc@1: 0.3577, Avg Acc@5: 0.6319
+INFO:master_logger:Val Step[0600/1563], Avg Loss: 2.9191, Avg Acc@1: 0.3645, Avg Acc@5: 0.6372
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.8632, Avg Acc@1: 0.3719, Avg Acc@5: 0.6502
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.8761, Avg Acc@1: 0.3714, Avg Acc@5: 0.6496
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.9270, Avg Acc@1: 0.3673, Avg Acc@5: 0.6388
+INFO:master_logger:Val Step[0650/1563], Avg Loss: 2.9314, Avg Acc@1: 0.3652, Avg Acc@5: 0.6358
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.9530, Avg Acc@1: 0.3571, Avg Acc@5: 0.6302
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.9695, Avg Acc@1: 0.3648, Avg Acc@5: 0.6246
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 3.0071, Avg Acc@1: 0.3502, Avg Acc@5: 0.6202
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 2.9429, Avg Acc@1: 0.3638, Avg Acc@5: 0.6384
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 3.0283, Avg Acc@1: 0.3566, Avg Acc@5: 0.6132
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 2.9857, Avg Acc@1: 0.3613, Avg Acc@5: 0.6268
+INFO:master_logger:Val Step[0700/1563], Avg Loss: 2.9910, Avg Acc@1: 0.3580, Avg Acc@5: 0.6246
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 3.0744, Avg Acc@1: 0.3502, Avg Acc@5: 0.6045
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 3.0678, Avg Acc@1: 0.3419, Avg Acc@5: 0.6080
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 3.0065, Avg Acc@1: 0.3557, Avg Acc@5: 0.6240
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 3.0359, Avg Acc@1: 0.3552, Avg Acc@5: 0.6150
+INFO:master_logger:Val Step[0750/1563], Avg Loss: 3.0462, Avg Acc@1: 0.3507, Avg Acc@5: 0.6129
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 3.1219, Avg Acc@1: 0.3436, Avg Acc@5: 0.5975
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 3.1013, Avg Acc@1: 0.3468, Avg Acc@5: 0.6042
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 3.1135, Avg Acc@1: 0.3374, Avg Acc@5: 0.6007
+INFO:master_logger:Val Step[0800/1563], Avg Loss: 3.0998, Avg Acc@1: 0.3434, Avg Acc@5: 0.6042
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 3.0626, Avg Acc@1: 0.3460, Avg Acc@5: 0.6142
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 3.1097, Avg Acc@1: 0.3389, Avg Acc@5: 0.6053
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 3.1370, Avg Acc@1: 0.3412, Avg Acc@5: 0.5987
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 3.1651, Avg Acc@1: 0.3380, Avg Acc@5: 0.5899
+INFO:master_logger:Val Step[0850/1563], Avg Loss: 3.1413, Avg Acc@1: 0.3376, Avg Acc@5: 0.5971
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 3.1533, Avg Acc@1: 0.3324, Avg Acc@5: 0.5943
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 3.1467, Avg Acc@1: 0.3406, Avg Acc@5: 0.5974
+INFO:master_logger:Val Step[0900/1563], Avg Loss: 3.1501, Avg Acc@1: 0.3373, Avg Acc@5: 0.5957
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 3.1158, Avg Acc@1: 0.3400, Avg Acc@5: 0.6046
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 3.1618, Avg Acc@1: 0.3323, Avg Acc@5: 0.5924
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 3.1762, Avg Acc@1: 0.3364, Avg Acc@5: 0.5882
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 3.2197, Avg Acc@1: 0.3312, Avg Acc@5: 0.5804
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 3.1976, Avg Acc@1: 0.3275, Avg Acc@5: 0.5860
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 3.1534, Avg Acc@1: 0.3345, Avg Acc@5: 0.5981
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 3.1872, Avg Acc@1: 0.3346, Avg Acc@5: 0.5894
+INFO:master_logger:Val Step[0950/1563], Avg Loss: 3.1895, Avg Acc@1: 0.3320, Avg Acc@5: 0.5885
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 3.1877, Avg Acc@1: 0.3307, Avg Acc@5: 0.5910
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 3.2247, Avg Acc@1: 0.3234, Avg Acc@5: 0.5820
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 3.2180, Avg Acc@1: 0.3305, Avg Acc@5: 0.5835
+INFO:master_logger:Val Step[1000/1563], Avg Loss: 3.2182, Avg Acc@1: 0.3280, Avg Acc@5: 0.5838
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 3.2425, Avg Acc@1: 0.3275, Avg Acc@5: 0.5785
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 3.2370, Avg Acc@1: 0.3198, Avg Acc@5: 0.5793
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 3.2629, Avg Acc@1: 0.3234, Avg Acc@5: 0.5759
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 3.2343, Avg Acc@1: 0.3275, Avg Acc@5: 0.5804
+INFO:master_logger:Val Step[1050/1563], Avg Loss: 3.2352, Avg Acc@1: 0.3247, Avg Acc@5: 0.5807
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 3.2066, Avg Acc@1: 0.3281, Avg Acc@5: 0.5872
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 3.2649, Avg Acc@1: 0.3158, Avg Acc@5: 0.5747
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 3.2600, Avg Acc@1: 0.3233, Avg Acc@5: 0.5745
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 3.2889, Avg Acc@1: 0.3193, Avg Acc@5: 0.5699
+INFO:master_logger:Val Step[1100/1563], Avg Loss: 3.2628, Avg Acc@1: 0.3208, Avg Acc@5: 0.5748
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 3.2376, Avg Acc@1: 0.3246, Avg Acc@5: 0.5802
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 3.2958, Avg Acc@1: 0.3126, Avg Acc@5: 0.5682
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 3.2918, Avg Acc@1: 0.3181, Avg Acc@5: 0.5690
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 3.2737, Avg Acc@1: 0.3193, Avg Acc@5: 0.5733
+INFO:master_logger:Val Step[1150/1563], Avg Loss: 3.2951, Avg Acc@1: 0.3161, Avg Acc@5: 0.5688
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 3.3189, Avg Acc@1: 0.3143, Avg Acc@5: 0.5648
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 3.3467, Avg Acc@1: 0.3104, Avg Acc@5: 0.5595
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 3.3010, Avg Acc@1: 0.3162, Avg Acc@5: 0.5687
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 3.3200, Avg Acc@1: 0.3146, Avg Acc@5: 0.5644
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 3.3171, Avg Acc@1: 0.3100, Avg Acc@5: 0.5653
+INFO:master_logger:Val Step[1200/1563], Avg Loss: 3.3212, Avg Acc@1: 0.3128, Avg Acc@5: 0.5645
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 3.3693, Avg Acc@1: 0.3077, Avg Acc@5: 0.5550
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 3.3382, Avg Acc@1: 0.3079, Avg Acc@5: 0.5609
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 3.3332, Avg Acc@1: 0.3121, Avg Acc@5: 0.5635
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 3.3434, Avg Acc@1: 0.3127, Avg Acc@5: 0.5589
+INFO:master_logger:Val Step[1250/1563], Avg Loss: 3.3460, Avg Acc@1: 0.3101, Avg Acc@5: 0.5596
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 3.3565, Avg Acc@1: 0.3088, Avg Acc@5: 0.5597
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 3.3794, Avg Acc@1: 0.3046, Avg Acc@5: 0.5523
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 3.3543, Avg Acc@1: 0.3045, Avg Acc@5: 0.5559
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 3.3565, Avg Acc@1: 0.3102, Avg Acc@5: 0.5560
+INFO:master_logger:Val Step[1300/1563], Avg Loss: 3.3617, Avg Acc@1: 0.3070, Avg Acc@5: 0.5560
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 3.3812, Avg Acc@1: 0.3066, Avg Acc@5: 0.5519
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 3.3799, Avg Acc@1: 0.3011, Avg Acc@5: 0.5507
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 3.3774, Avg Acc@1: 0.3057, Avg Acc@5: 0.5554
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 3.4115, Avg Acc@1: 0.2998, Avg Acc@5: 0.5456
+INFO:master_logger:Val Step[1350/1563], Avg Loss: 3.3875, Avg Acc@1: 0.3033, Avg Acc@5: 0.5509
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 3.3939, Avg Acc@1: 0.2987, Avg Acc@5: 0.5493
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 3.4229, Avg Acc@1: 0.2974, Avg Acc@5: 0.5441
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 3.3972, Avg Acc@1: 0.3036, Avg Acc@5: 0.5493
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 3.3874, Avg Acc@1: 0.3045, Avg Acc@5: 0.5537
+INFO:master_logger:Val Step[1400/1563], Avg Loss: 3.4004, Avg Acc@1: 0.3011, Avg Acc@5: 0.5491
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 3.3982, Avg Acc@1: 0.3033, Avg Acc@5: 0.5485
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 3.3976, Avg Acc@1: 0.2993, Avg Acc@5: 0.5485
+INFO:master_logger:Val Step[1450/1563], Avg Loss: 3.4039, Avg Acc@1: 0.3009, Avg Acc@5: 0.5480
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 3.4311, Avg Acc@1: 0.2966, Avg Acc@5: 0.5426
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 3.3887, Avg Acc@1: 0.3045, Avg Acc@5: 0.5525
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 3.4049, Avg Acc@1: 0.3009, Avg Acc@5: 0.5480
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 3.3732, Avg Acc@1: 0.3079, Avg Acc@5: 0.5536
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 3.3602, Avg Acc@1: 0.3095, Avg Acc@5: 0.5579
+INFO:master_logger:Val Step[1500/1563], Avg Loss: 3.3757, Avg Acc@1: 0.3060, Avg Acc@5: 0.5537
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 3.3644, Avg Acc@1: 0.3059, Avg Acc@5: 0.5554
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 3.3425, Avg Acc@1: 0.3139, Avg Acc@5: 0.5594
+INFO:master_logger:Val Step[1550/1563], Avg Loss: 3.3514, Avg Acc@1: 0.3107, Avg Acc@5: 0.5579
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 3.3411, Avg Acc@1: 0.3127, Avg Acc@5: 0.5609
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 3.3776, Avg Acc@1: 0.3063, Avg Acc@5: 0.5526
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 3.3444, Avg Acc@1: 0.3100, Avg Acc@5: 0.5588
+INFO:local_logger:----- Epoch[010/300], Validation Loss: 3.3393, Validation Acc@1: 0.3109, Validation Acc@5: 0.5594, time: 180.49
+INFO:local_logger:----- Epoch[010/300], Validation Loss: 3.3362, Validation Acc@1: 0.3149, Validation Acc@5: 0.5602, time: 180.49
+INFO:master_logger:----- Epoch[010/300], Validation Loss: 3.3455, Validation Acc@1: 0.3119, Validation Acc@5: 0.5589, time: 180.49
+INFO:local_logger:----- Epoch[010/300], Validation Loss: 3.3703, Validation Acc@1: 0.3079, Validation Acc@5: 0.5538, time: 180.73
+INFO:local_logger:Now training epoch 11. LR=0.000390
+INFO:local_logger:----- Epoch[010/300], Validation Loss: 3.3363, Validation Acc@1: 0.3138, Validation Acc@5: 0.5619, time: 180.49
+INFO:local_logger:Now training epoch 11. LR=0.000390
+INFO:local_logger:Now training epoch 11. LR=0.000390
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-10-Loss-5.189446779206341.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-10-Loss-5.189446779206341.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-10-Loss-5.189446779206341-EMA.pdparams
+INFO:local_logger:Now training epoch 11. LR=0.000390
+INFO:master_logger:Now training epoch 11. LR=0.000390
+INFO:local_logger:Epoch[011/300], Step[0000/1602], Avg Loss: 5.2185, Avg Acc: 0.1250
+INFO:local_logger:Epoch[011/300], Step[0000/1602], Avg Loss: 5.4732, Avg Acc: 0.0000
+INFO:local_logger:Epoch[011/300], Step[0000/1602], Avg Loss: 5.3601, Avg Acc: 0.0800
+INFO:master_logger:Epoch[011/300], Step[0000/1602], Avg Loss: 5.3827, Avg Acc: 0.0662
+INFO:local_logger:Epoch[011/300], Step[0000/1602], Avg Loss: 5.4790, Avg Acc: 0.0600
+INFO:local_logger:Epoch[011/300], Step[0050/1602], Avg Loss: 5.2147, Avg Acc: 0.0525
+INFO:local_logger:Epoch[011/300], Step[0050/1602], Avg Loss: 5.0781, Avg Acc: 0.1041
+INFO:local_logger:Epoch[011/300], Step[0050/1602], Avg Loss: 5.1205, Avg Acc: 0.0748
+INFO:master_logger:Epoch[011/300], Step[0050/1602], Avg Loss: 5.1605, Avg Acc: 0.0803
+INFO:local_logger:Epoch[011/300], Step[0050/1602], Avg Loss: 5.2285, Avg Acc: 0.0899
+INFO:local_logger:Epoch[011/300], Step[0100/1602], Avg Loss: 5.0928, Avg Acc: 0.1021
+INFO:local_logger:Epoch[011/300], Step[0100/1602], Avg Loss: 5.1867, Avg Acc: 0.1030
+INFO:local_logger:Epoch[011/300], Step[0100/1602], Avg Loss: 5.1334, Avg Acc: 0.0803
+INFO:local_logger:Epoch[011/300], Step[0100/1602], Avg Loss: 5.1774, Avg Acc: 0.0759
+INFO:master_logger:Epoch[011/300], Step[0100/1602], Avg Loss: 5.1476, Avg Acc: 0.0903
+INFO:local_logger:Epoch[011/300], Step[0150/1602], Avg Loss: 5.1493, Avg Acc: 0.0845
+INFO:master_logger:Epoch[011/300], Step[0150/1602], Avg Loss: 5.1461, Avg Acc: 0.0904
+INFO:local_logger:Epoch[011/300], Step[0150/1602], Avg Loss: 5.1545, Avg Acc: 0.0789
+INFO:local_logger:Epoch[011/300], Step[0150/1602], Avg Loss: 5.1011, Avg Acc: 0.0975
+INFO:local_logger:Epoch[011/300], Step[0150/1602], Avg Loss: 5.1795, Avg Acc: 0.1009
+INFO:local_logger:Epoch[011/300], Step[0200/1602], Avg Loss: 5.1538, Avg Acc: 0.0830
+INFO:local_logger:Epoch[011/300], Step[0200/1602], Avg Loss: 5.1565, Avg Acc: 0.0870
+INFO:local_logger:Epoch[011/300], Step[0200/1602], Avg Loss: 5.1709, Avg Acc: 0.1034
+INFO:local_logger:Epoch[011/300], Step[0200/1602], Avg Loss: 5.0957, Avg Acc: 0.0989
+INFO:master_logger:Epoch[011/300], Step[0200/1602], Avg Loss: 5.1442, Avg Acc: 0.0931
+INFO:local_logger:Epoch[011/300], Step[0250/1602], Avg Loss: 5.1436, Avg Acc: 0.0918
+INFO:local_logger:Epoch[011/300], Step[0250/1602], Avg Loss: 5.1330, Avg Acc: 0.0871
+INFO:local_logger:Epoch[011/300], Step[0250/1602], Avg Loss: 5.1529, Avg Acc: 0.1046
+INFO:master_logger:Epoch[011/300], Step[0250/1602], Avg Loss: 5.1315, Avg Acc: 0.0961
+INFO:local_logger:Epoch[011/300], Step[0250/1602], Avg Loss: 5.0963, Avg Acc: 0.1011
+INFO:local_logger:Epoch[011/300], Step[0300/1602], Avg Loss: 5.1426, Avg Acc: 0.0946
+INFO:local_logger:Epoch[011/300], Step[0300/1602], Avg Loss: 5.1322, Avg Acc: 0.0879
+INFO:local_logger:Epoch[011/300], Step[0300/1602], Avg Loss: 5.1391, Avg Acc: 0.0993
+INFO:master_logger:Epoch[011/300], Step[0300/1602], Avg Loss: 5.1272, Avg Acc: 0.0960
+INFO:local_logger:Epoch[011/300], Step[0300/1602], Avg Loss: 5.0949, Avg Acc: 0.1019
+INFO:local_logger:Epoch[011/300], Step[0350/1602], Avg Loss: 5.1235, Avg Acc: 0.0946
+INFO:local_logger:Epoch[011/300], Step[0350/1602], Avg Loss: 5.1449, Avg Acc: 0.0884
+INFO:local_logger:Epoch[011/300], Step[0350/1602], Avg Loss: 5.1366, Avg Acc: 0.1001
+INFO:master_logger:Epoch[011/300], Step[0350/1602], Avg Loss: 5.1279, Avg Acc: 0.0961
+INFO:local_logger:Epoch[011/300], Step[0350/1602], Avg Loss: 5.1065, Avg Acc: 0.1013
+INFO:local_logger:Epoch[011/300], Step[0400/1602], Avg Loss: 5.1351, Avg Acc: 0.0900
+INFO:local_logger:Epoch[011/300], Step[0400/1602], Avg Loss: 5.1177, Avg Acc: 0.0936
+INFO:master_logger:Epoch[011/300], Step[0400/1602], Avg Loss: 5.1284, Avg Acc: 0.0957
+INFO:local_logger:Epoch[011/300], Step[0400/1602], Avg Loss: 5.1304, Avg Acc: 0.1019
+INFO:local_logger:Epoch[011/300], Step[0400/1602], Avg Loss: 5.1306, Avg Acc: 0.0974
+INFO:local_logger:Epoch[011/300], Step[0450/1602], Avg Loss: 5.1221, Avg Acc: 0.0943
+INFO:local_logger:Epoch[011/300], Step[0450/1602], Avg Loss: 5.1163, Avg Acc: 0.0969
+INFO:local_logger:Epoch[011/300], Step[0450/1602], Avg Loss: 5.1254, Avg Acc: 0.1022
+INFO:local_logger:Epoch[011/300], Step[0450/1602], Avg Loss: 5.1429, Avg Acc: 0.0895
+INFO:master_logger:Epoch[011/300], Step[0450/1602], Avg Loss: 5.1267, Avg Acc: 0.0957
+INFO:local_logger:Epoch[011/300], Step[0500/1602], Avg Loss: 5.1134, Avg Acc: 0.0951
+INFO:master_logger:Epoch[011/300], Step[0500/1602], Avg Loss: 5.1253, Avg Acc: 0.0959
+INFO:local_logger:Epoch[011/300], Step[0500/1602], Avg Loss: 5.1123, Avg Acc: 0.0974
+INFO:local_logger:Epoch[011/300], Step[0500/1602], Avg Loss: 5.1531, Avg Acc: 0.0889
+INFO:local_logger:Epoch[011/300], Step[0500/1602], Avg Loss: 5.1224, Avg Acc: 0.1021
+INFO:local_logger:Epoch[011/300], Step[0550/1602], Avg Loss: 5.1091, Avg Acc: 0.0975
+INFO:local_logger:Epoch[011/300], Step[0550/1602], Avg Loss: 5.1186, Avg Acc: 0.0959
+INFO:local_logger:Epoch[011/300], Step[0550/1602], Avg Loss: 5.1504, Avg Acc: 0.0892
+INFO:local_logger:Epoch[011/300], Step[0550/1602], Avg Loss: 5.1152, Avg Acc: 0.1011
+INFO:master_logger:Epoch[011/300], Step[0550/1602], Avg Loss: 5.1233, Avg Acc: 0.0959
+INFO:local_logger:Epoch[011/300], Step[0600/1602], Avg Loss: 5.1113, Avg Acc: 0.1025
+INFO:local_logger:Epoch[011/300], Step[0600/1602], Avg Loss: 5.1221, Avg Acc: 0.0960
+INFO:local_logger:Epoch[011/300], Step[0600/1602], Avg Loss: 5.1040, Avg Acc: 0.0986
+INFO:local_logger:Epoch[011/300], Step[0600/1602], Avg Loss: 5.1424, Avg Acc: 0.0904
+INFO:master_logger:Epoch[011/300], Step[0600/1602], Avg Loss: 5.1200, Avg Acc: 0.0969
+INFO:local_logger:Epoch[011/300], Step[0650/1602], Avg Loss: 5.1234, Avg Acc: 0.0963
+INFO:local_logger:Epoch[011/300], Step[0650/1602], Avg Loss: 5.0974, Avg Acc: 0.0981
+INFO:local_logger:Epoch[011/300], Step[0650/1602], Avg Loss: 5.1410, Avg Acc: 0.0917
+INFO:local_logger:Epoch[011/300], Step[0650/1602], Avg Loss: 5.1126, Avg Acc: 0.1018
+INFO:master_logger:Epoch[011/300], Step[0650/1602], Avg Loss: 5.1186, Avg Acc: 0.0970
+INFO:local_logger:Epoch[011/300], Step[0700/1602], Avg Loss: 5.1406, Avg Acc: 0.0922
+INFO:local_logger:Epoch[011/300], Step[0700/1602], Avg Loss: 5.1143, Avg Acc: 0.1012
+INFO:local_logger:Epoch[011/300], Step[0700/1602], Avg Loss: 5.1196, Avg Acc: 0.0960
+INFO:local_logger:Epoch[011/300], Step[0700/1602], Avg Loss: 5.0990, Avg Acc: 0.0964
+INFO:master_logger:Epoch[011/300], Step[0700/1602], Avg Loss: 5.1183, Avg Acc: 0.0965
+INFO:local_logger:Epoch[011/300], Step[0750/1602], Avg Loss: 5.1192, Avg Acc: 0.0961
+INFO:local_logger:Epoch[011/300], Step[0750/1602], Avg Loss: 5.0977, Avg Acc: 0.0976
+INFO:local_logger:Epoch[011/300], Step[0750/1602], Avg Loss: 5.1318, Avg Acc: 0.0934
+INFO:local_logger:Epoch[011/300], Step[0750/1602], Avg Loss: 5.1138, Avg Acc: 0.1002
+INFO:master_logger:Epoch[011/300], Step[0750/1602], Avg Loss: 5.1156, Avg Acc: 0.0968
+INFO:local_logger:Epoch[011/300], Step[0800/1602], Avg Loss: 5.1107, Avg Acc: 0.0966
+INFO:local_logger:Epoch[011/300], Step[0800/1602], Avg Loss: 5.1109, Avg Acc: 0.1001
+INFO:local_logger:Epoch[011/300], Step[0800/1602], Avg Loss: 5.1235, Avg Acc: 0.0935
+INFO:local_logger:Epoch[011/300], Step[0800/1602], Avg Loss: 5.1010, Avg Acc: 0.0977
+INFO:master_logger:Epoch[011/300], Step[0800/1602], Avg Loss: 5.1115, Avg Acc: 0.0970
+INFO:local_logger:Epoch[011/300], Step[0850/1602], Avg Loss: 5.1121, Avg Acc: 0.0966
+INFO:master_logger:Epoch[011/300], Step[0850/1602], Avg Loss: 5.1129, Avg Acc: 0.0972
+INFO:local_logger:Epoch[011/300], Step[0850/1602], Avg Loss: 5.0994, Avg Acc: 0.0973
+INFO:local_logger:Epoch[011/300], Step[0850/1602], Avg Loss: 5.1138, Avg Acc: 0.1005
+INFO:local_logger:Epoch[011/300], Step[0850/1602], Avg Loss: 5.1263, Avg Acc: 0.0942
+INFO:local_logger:Epoch[011/300], Step[0900/1602], Avg Loss: 5.1103, Avg Acc: 0.0964
+INFO:local_logger:Epoch[011/300], Step[0900/1602], Avg Loss: 5.1239, Avg Acc: 0.0947
+INFO:local_logger:Epoch[011/300], Step[0900/1602], Avg Loss: 5.0945, Avg Acc: 0.0983
+INFO:local_logger:Epoch[011/300], Step[0900/1602], Avg Loss: 5.1111, Avg Acc: 0.1004
+INFO:master_logger:Epoch[011/300], Step[0900/1602], Avg Loss: 5.1099, Avg Acc: 0.0975
+INFO:local_logger:Epoch[011/300], Step[0950/1602], Avg Loss: 5.0966, Avg Acc: 0.0979
+INFO:local_logger:Epoch[011/300], Step[0950/1602], Avg Loss: 5.1118, Avg Acc: 0.0971
+INFO:local_logger:Epoch[011/300], Step[0950/1602], Avg Loss: 5.1197, Avg Acc: 0.0960
+INFO:master_logger:Epoch[011/300], Step[0950/1602], Avg Loss: 5.1086, Avg Acc: 0.0977
+INFO:local_logger:Epoch[011/300], Step[0950/1602], Avg Loss: 5.1065, Avg Acc: 0.0999
+INFO:local_logger:Epoch[011/300], Step[1000/1602], Avg Loss: 5.1125, Avg Acc: 0.0973
+INFO:local_logger:Epoch[011/300], Step[1000/1602], Avg Loss: 5.1189, Avg Acc: 0.0971
+INFO:local_logger:Epoch[011/300], Step[1000/1602], Avg Loss: 5.1079, Avg Acc: 0.0988
+INFO:master_logger:Epoch[011/300], Step[1000/1602], Avg Loss: 5.1087, Avg Acc: 0.0980
+INFO:local_logger:Epoch[011/300], Step[1000/1602], Avg Loss: 5.0953, Avg Acc: 0.0988
+INFO:local_logger:Epoch[011/300], Step[1050/1602], Avg Loss: 5.1109, Avg Acc: 0.0985
+INFO:local_logger:Epoch[011/300], Step[1050/1602], Avg Loss: 5.0981, Avg Acc: 0.0989
+INFO:local_logger:Epoch[011/300], Step[1050/1602], Avg Loss: 5.1116, Avg Acc: 0.0980
+INFO:local_logger:Epoch[011/300], Step[1050/1602], Avg Loss: 5.1111, Avg Acc: 0.0981
+INFO:master_logger:Epoch[011/300], Step[1050/1602], Avg Loss: 5.1079, Avg Acc: 0.0984
+INFO:local_logger:Epoch[011/300], Step[1100/1602], Avg Loss: 5.1014, Avg Acc: 0.0989
+INFO:local_logger:Epoch[011/300], Step[1100/1602], Avg Loss: 5.1089, Avg Acc: 0.0975
+INFO:local_logger:Epoch[011/300], Step[1100/1602], Avg Loss: 5.1094, Avg Acc: 0.0985
+INFO:local_logger:Epoch[011/300], Step[1100/1602], Avg Loss: 5.1157, Avg Acc: 0.0984
+INFO:master_logger:Epoch[011/300], Step[1100/1602], Avg Loss: 5.1089, Avg Acc: 0.0983
+INFO:local_logger:Epoch[011/300], Step[1150/1602], Avg Loss: 5.1084, Avg Acc: 0.0979
+INFO:master_logger:Epoch[011/300], Step[1150/1602], Avg Loss: 5.1072, Avg Acc: 0.0984
+INFO:local_logger:Epoch[011/300], Step[1150/1602], Avg Loss: 5.1107, Avg Acc: 0.0982
+INFO:local_logger:Epoch[011/300], Step[1150/1602], Avg Loss: 5.1109, Avg Acc: 0.0989
+INFO:local_logger:Epoch[011/300], Step[1150/1602], Avg Loss: 5.0989, Avg Acc: 0.0987
+INFO:local_logger:Epoch[011/300], Step[1200/1602], Avg Loss: 5.1100, Avg Acc: 0.0983
+INFO:local_logger:Epoch[011/300], Step[1200/1602], Avg Loss: 5.1086, Avg Acc: 0.0980
+INFO:local_logger:Epoch[011/300], Step[1200/1602], Avg Loss: 5.0957, Avg Acc: 0.0994
+INFO:local_logger:Epoch[011/300], Step[1200/1602], Avg Loss: 5.1111, Avg Acc: 0.0985
+INFO:master_logger:Epoch[011/300], Step[1200/1602], Avg Loss: 5.1063, Avg Acc: 0.0985
+INFO:local_logger:Epoch[011/300], Step[1250/1602], Avg Loss: 5.1061, Avg Acc: 0.0987
+INFO:local_logger:Epoch[011/300], Step[1250/1602], Avg Loss: 5.0968, Avg Acc: 0.0993
+INFO:local_logger:Epoch[011/300], Step[1250/1602], Avg Loss: 5.1126, Avg Acc: 0.0989
+INFO:local_logger:Epoch[011/300], Step[1250/1602], Avg Loss: 5.1130, Avg Acc: 0.0985
+INFO:master_logger:Epoch[011/300], Step[1250/1602], Avg Loss: 5.1071, Avg Acc: 0.0989
+INFO:local_logger:Epoch[011/300], Step[1300/1602], Avg Loss: 5.1148, Avg Acc: 0.0988
+INFO:local_logger:Epoch[011/300], Step[1300/1602], Avg Loss: 5.1058, Avg Acc: 0.0987
+INFO:local_logger:Epoch[011/300], Step[1300/1602], Avg Loss: 5.1134, Avg Acc: 0.0987
+INFO:master_logger:Epoch[011/300], Step[1300/1602], Avg Loss: 5.1076, Avg Acc: 0.0989
+INFO:local_logger:Epoch[011/300], Step[1300/1602], Avg Loss: 5.0963, Avg Acc: 0.0995
+INFO:local_logger:Epoch[011/300], Step[1350/1602], Avg Loss: 5.1074, Avg Acc: 0.0981
+INFO:local_logger:Epoch[011/300], Step[1350/1602], Avg Loss: 5.1139, Avg Acc: 0.0986
+INFO:local_logger:Epoch[011/300], Step[1350/1602], Avg Loss: 5.1119, Avg Acc: 0.0993
+INFO:local_logger:Epoch[011/300], Step[1350/1602], Avg Loss: 5.0931, Avg Acc: 0.0994
+INFO:master_logger:Epoch[011/300], Step[1350/1602], Avg Loss: 5.1066, Avg Acc: 0.0988
+INFO:local_logger:Epoch[011/300], Step[1400/1602], Avg Loss: 5.1045, Avg Acc: 0.0982
+INFO:local_logger:Epoch[011/300], Step[1400/1602], Avg Loss: 5.1116, Avg Acc: 0.0993
+INFO:local_logger:Epoch[011/300], Step[1400/1602], Avg Loss: 5.0938, Avg Acc: 0.1001
+INFO:master_logger:Epoch[011/300], Step[1400/1602], Avg Loss: 5.1050, Avg Acc: 0.0991
+INFO:local_logger:Epoch[011/300], Step[1400/1602], Avg Loss: 5.1101, Avg Acc: 0.0990
+INFO:local_logger:Epoch[011/300], Step[1450/1602], Avg Loss: 5.1035, Avg Acc: 0.0987
+INFO:local_logger:Epoch[011/300], Step[1450/1602], Avg Loss: 5.1105, Avg Acc: 0.0994
+INFO:local_logger:Epoch[011/300], Step[1450/1602], Avg Loss: 5.1061, Avg Acc: 0.0992
+INFO:local_logger:Epoch[011/300], Step[1450/1602], Avg Loss: 5.0917, Avg Acc: 0.1003
+INFO:master_logger:Epoch[011/300], Step[1450/1602], Avg Loss: 5.1029, Avg Acc: 0.0994
+INFO:local_logger:Epoch[011/300], Step[1500/1602], Avg Loss: 5.1011, Avg Acc: 0.0982
+INFO:local_logger:Epoch[011/300], Step[1500/1602], Avg Loss: 5.1036, Avg Acc: 0.0996
+INFO:local_logger:Epoch[011/300], Step[1500/1602], Avg Loss: 5.1100, Avg Acc: 0.0994
+INFO:local_logger:Epoch[011/300], Step[1500/1602], Avg Loss: 5.0943, Avg Acc: 0.0999
+INFO:master_logger:Epoch[011/300], Step[1500/1602], Avg Loss: 5.1023, Avg Acc: 0.0993
+INFO:local_logger:Epoch[011/300], Step[1550/1602], Avg Loss: 5.1010, Avg Acc: 0.0984
+INFO:local_logger:Epoch[011/300], Step[1550/1602], Avg Loss: 5.0982, Avg Acc: 0.0998
+INFO:local_logger:Epoch[011/300], Step[1550/1602], Avg Loss: 5.0942, Avg Acc: 0.1005
+INFO:local_logger:Epoch[011/300], Step[1550/1602], Avg Loss: 5.1082, Avg Acc: 0.0999
+INFO:master_logger:Epoch[011/300], Step[1550/1602], Avg Loss: 5.1004, Avg Acc: 0.0997
+INFO:local_logger:Epoch[011/300], Step[1600/1602], Avg Loss: 5.0940, Avg Acc: 0.1011
+INFO:local_logger:Epoch[011/300], Step[1600/1602], Avg Loss: 5.0970, Avg Acc: 0.0996
+INFO:local_logger:Epoch[011/300], Step[1600/1602], Avg Loss: 5.0980, Avg Acc: 0.0988
+INFO:master_logger:Epoch[011/300], Step[1600/1602], Avg Loss: 5.0997, Avg Acc: 0.0998
+INFO:local_logger:Epoch[011/300], Step[1600/1602], Avg Loss: 5.1099, Avg Acc: 0.0997
+INFO:local_logger:----- Epoch[011/300], Train Loss: 5.0971, Train Acc: 0.0996, time: 3692.11
+INFO:local_logger:Now training epoch 12. LR=0.000390
+INFO:local_logger:----- Epoch[011/300], Train Loss: 5.0940, Train Acc: 0.1011, time: 3692.43
+INFO:local_logger:Now training epoch 12. LR=0.000390
+INFO:local_logger:----- Epoch[011/300], Train Loss: 5.1100, Train Acc: 0.0997, time: 3692.43
+INFO:local_logger:----- Epoch[011/300], Train Loss: 5.0978, Train Acc: 0.0989, time: 3692.16
+INFO:local_logger:Now training epoch 12. LR=0.000390
+INFO:master_logger:----- Epoch[011/300], Train Loss: 5.0997, Train Acc: 0.0998, time: 3692.16
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-11-Loss-5.097786829372229.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-11-Loss-5.097786829372229.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-11-Loss-5.097786829372229-EMA.pdparams
+INFO:local_logger:Now training epoch 12. LR=0.000390
+INFO:master_logger:Now training epoch 12. LR=0.000390
+INFO:local_logger:Epoch[012/300], Step[0000/1602], Avg Loss: 5.0069, Avg Acc: 0.1900
+INFO:local_logger:Epoch[012/300], Step[0000/1602], Avg Loss: 4.7816, Avg Acc: 0.1700
+INFO:master_logger:Epoch[012/300], Step[0000/1602], Avg Loss: 5.0039, Avg Acc: 0.0938
+INFO:local_logger:Epoch[012/300], Step[0000/1602], Avg Loss: 4.6343, Avg Acc: 0.0050
+INFO:local_logger:Epoch[012/300], Step[0000/1602], Avg Loss: 5.5927, Avg Acc: 0.0100
+INFO:local_logger:Epoch[012/300], Step[0050/1602], Avg Loss: 4.9680, Avg Acc: 0.1118
+INFO:local_logger:Epoch[012/300], Step[0050/1602], Avg Loss: 4.9512, Avg Acc: 0.1171
+INFO:local_logger:Epoch[012/300], Step[0050/1602], Avg Loss: 4.9973, Avg Acc: 0.1127
+INFO:master_logger:Epoch[012/300], Step[0050/1602], Avg Loss: 4.9588, Avg Acc: 0.1164
+INFO:local_logger:Epoch[012/300], Step[0050/1602], Avg Loss: 4.9188, Avg Acc: 0.1239
+INFO:local_logger:Epoch[012/300], Step[0100/1602], Avg Loss: 5.0242, Avg Acc: 0.1047
+INFO:local_logger:Epoch[012/300], Step[0100/1602], Avg Loss: 4.9876, Avg Acc: 0.1136
+INFO:local_logger:Epoch[012/300], Step[0100/1602], Avg Loss: 5.0095, Avg Acc: 0.1112
+INFO:local_logger:Epoch[012/300], Step[0100/1602], Avg Loss: 4.9660, Avg Acc: 0.1088
+INFO:master_logger:Epoch[012/300], Step[0100/1602], Avg Loss: 4.9968, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[0150/1602], Avg Loss: 5.0381, Avg Acc: 0.1020
+INFO:master_logger:Epoch[012/300], Step[0150/1602], Avg Loss: 4.9992, Avg Acc: 0.1088
+INFO:local_logger:Epoch[012/300], Step[0150/1602], Avg Loss: 5.0007, Avg Acc: 0.1119
+INFO:local_logger:Epoch[012/300], Step[0150/1602], Avg Loss: 4.9890, Avg Acc: 0.1135
+INFO:local_logger:Epoch[012/300], Step[0150/1602], Avg Loss: 4.9690, Avg Acc: 0.1078
+INFO:local_logger:Epoch[012/300], Step[0200/1602], Avg Loss: 5.0063, Avg Acc: 0.1045
+INFO:local_logger:Epoch[012/300], Step[0200/1602], Avg Loss: 5.0013, Avg Acc: 0.1149
+INFO:local_logger:Epoch[012/300], Step[0200/1602], Avg Loss: 4.9689, Avg Acc: 0.1147
+INFO:local_logger:Epoch[012/300], Step[0200/1602], Avg Loss: 5.0106, Avg Acc: 0.1033
+INFO:master_logger:Epoch[012/300], Step[0200/1602], Avg Loss: 4.9968, Avg Acc: 0.1093
+INFO:local_logger:Epoch[012/300], Step[0250/1602], Avg Loss: 5.0118, Avg Acc: 0.1063
+INFO:local_logger:Epoch[012/300], Step[0250/1602], Avg Loss: 4.9646, Avg Acc: 0.1170
+INFO:local_logger:Epoch[012/300], Step[0250/1602], Avg Loss: 5.0085, Avg Acc: 0.1134
+INFO:local_logger:Epoch[012/300], Step[0250/1602], Avg Loss: 4.9927, Avg Acc: 0.1056
+INFO:master_logger:Epoch[012/300], Step[0250/1602], Avg Loss: 4.9944, Avg Acc: 0.1106
+INFO:local_logger:Epoch[012/300], Step[0300/1602], Avg Loss: 4.9991, Avg Acc: 0.1064
+INFO:local_logger:Epoch[012/300], Step[0300/1602], Avg Loss: 5.0111, Avg Acc: 0.1085
+INFO:local_logger:Epoch[012/300], Step[0300/1602], Avg Loss: 4.9891, Avg Acc: 0.1151
+INFO:local_logger:Epoch[012/300], Step[0300/1602], Avg Loss: 5.0067, Avg Acc: 0.1148
+INFO:master_logger:Epoch[012/300], Step[0300/1602], Avg Loss: 5.0015, Avg Acc: 0.1112
+INFO:local_logger:Epoch[012/300], Step[0350/1602], Avg Loss: 5.0081, Avg Acc: 0.1076
+INFO:local_logger:Epoch[012/300], Step[0350/1602], Avg Loss: 5.0118, Avg Acc: 0.1061
+INFO:local_logger:Epoch[012/300], Step[0350/1602], Avg Loss: 5.0179, Avg Acc: 0.1125
+INFO:local_logger:Epoch[012/300], Step[0350/1602], Avg Loss: 4.9965, Avg Acc: 0.1139
+INFO:master_logger:Epoch[012/300], Step[0350/1602], Avg Loss: 5.0086, Avg Acc: 0.1100
+INFO:local_logger:Epoch[012/300], Step[0400/1602], Avg Loss: 5.0155, Avg Acc: 0.1077
+INFO:local_logger:Epoch[012/300], Step[0400/1602], Avg Loss: 5.0209, Avg Acc: 0.1110
+INFO:master_logger:Epoch[012/300], Step[0400/1602], Avg Loss: 5.0165, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[0400/1602], Avg Loss: 5.0058, Avg Acc: 0.1139
+INFO:local_logger:Epoch[012/300], Step[0400/1602], Avg Loss: 5.0237, Avg Acc: 0.1079
+INFO:local_logger:Epoch[012/300], Step[0450/1602], Avg Loss: 5.0278, Avg Acc: 0.1055
+INFO:local_logger:Epoch[012/300], Step[0450/1602], Avg Loss: 5.0137, Avg Acc: 0.1107
+INFO:local_logger:Epoch[012/300], Step[0450/1602], Avg Loss: 5.0177, Avg Acc: 0.1057
+INFO:local_logger:Epoch[012/300], Step[0450/1602], Avg Loss: 5.0092, Avg Acc: 0.1124
+INFO:master_logger:Epoch[012/300], Step[0450/1602], Avg Loss: 5.0171, Avg Acc: 0.1086
+INFO:local_logger:Epoch[012/300], Step[0500/1602], Avg Loss: 5.0163, Avg Acc: 0.1046
+INFO:local_logger:Epoch[012/300], Step[0500/1602], Avg Loss: 5.0139, Avg Acc: 0.1113
+INFO:local_logger:Epoch[012/300], Step[0500/1602], Avg Loss: 4.9942, Avg Acc: 0.1132
+INFO:local_logger:Epoch[012/300], Step[0500/1602], Avg Loss: 5.0286, Avg Acc: 0.1062
+INFO:master_logger:Epoch[012/300], Step[0500/1602], Avg Loss: 5.0132, Avg Acc: 0.1088
+INFO:local_logger:Epoch[012/300], Step[0550/1602], Avg Loss: 5.0161, Avg Acc: 0.1057
+INFO:local_logger:Epoch[012/300], Step[0550/1602], Avg Loss: 5.0153, Avg Acc: 0.1075
+INFO:local_logger:Epoch[012/300], Step[0550/1602], Avg Loss: 5.0044, Avg Acc: 0.1102
+INFO:local_logger:Epoch[012/300], Step[0550/1602], Avg Loss: 4.9815, Avg Acc: 0.1150
+INFO:master_logger:Epoch[012/300], Step[0550/1602], Avg Loss: 5.0044, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[0600/1602], Avg Loss: 5.0152, Avg Acc: 0.1061
+INFO:local_logger:Epoch[012/300], Step[0600/1602], Avg Loss: 5.0072, Avg Acc: 0.1073
+INFO:local_logger:Epoch[012/300], Step[0600/1602], Avg Loss: 4.9728, Avg Acc: 0.1154
+INFO:local_logger:Epoch[012/300], Step[0600/1602], Avg Loss: 5.0004, Avg Acc: 0.1097
+INFO:master_logger:Epoch[012/300], Step[0600/1602], Avg Loss: 4.9989, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[0650/1602], Avg Loss: 5.0155, Avg Acc: 0.1068
+INFO:local_logger:Epoch[012/300], Step[0650/1602], Avg Loss: 5.0008, Avg Acc: 0.1077
+INFO:local_logger:Epoch[012/300], Step[0650/1602], Avg Loss: 4.9989, Avg Acc: 0.1088
+INFO:local_logger:Epoch[012/300], Step[0650/1602], Avg Loss: 4.9759, Avg Acc: 0.1150
+INFO:master_logger:Epoch[012/300], Step[0650/1602], Avg Loss: 4.9978, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[0700/1602], Avg Loss: 5.0166, Avg Acc: 0.1073
+INFO:local_logger:Epoch[012/300], Step[0700/1602], Avg Loss: 4.9945, Avg Acc: 0.1080
+INFO:local_logger:Epoch[012/300], Step[0700/1602], Avg Loss: 4.9990, Avg Acc: 0.1084
+INFO:local_logger:Epoch[012/300], Step[0700/1602], Avg Loss: 4.9749, Avg Acc: 0.1160
+INFO:master_logger:Epoch[012/300], Step[0700/1602], Avg Loss: 4.9963, Avg Acc: 0.1099
+INFO:local_logger:Epoch[012/300], Step[0750/1602], Avg Loss: 5.0170, Avg Acc: 0.1066
+INFO:local_logger:Epoch[012/300], Step[0750/1602], Avg Loss: 4.9920, Avg Acc: 0.1088
+INFO:local_logger:Epoch[012/300], Step[0750/1602], Avg Loss: 4.9769, Avg Acc: 0.1156
+INFO:local_logger:Epoch[012/300], Step[0750/1602], Avg Loss: 4.9925, Avg Acc: 0.1086
+INFO:master_logger:Epoch[012/300], Step[0750/1602], Avg Loss: 4.9946, Avg Acc: 0.1099
+INFO:local_logger:Epoch[012/300], Step[0800/1602], Avg Loss: 5.0084, Avg Acc: 0.1066
+INFO:local_logger:Epoch[012/300], Step[0800/1602], Avg Loss: 4.9955, Avg Acc: 0.1085
+INFO:local_logger:Epoch[012/300], Step[0800/1602], Avg Loss: 4.9694, Avg Acc: 0.1158
+INFO:local_logger:Epoch[012/300], Step[0800/1602], Avg Loss: 4.9964, Avg Acc: 0.1074
+INFO:master_logger:Epoch[012/300], Step[0800/1602], Avg Loss: 4.9924, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[0850/1602], Avg Loss: 5.0072, Avg Acc: 0.1075
+INFO:local_logger:Epoch[012/300], Step[0850/1602], Avg Loss: 5.0002, Avg Acc: 0.1089
+INFO:local_logger:Epoch[012/300], Step[0850/1602], Avg Loss: 4.9648, Avg Acc: 0.1146
+INFO:master_logger:Epoch[012/300], Step[0850/1602], Avg Loss: 4.9916, Avg Acc: 0.1098
+INFO:local_logger:Epoch[012/300], Step[0850/1602], Avg Loss: 4.9942, Avg Acc: 0.1082
+INFO:local_logger:Epoch[012/300], Step[0900/1602], Avg Loss: 4.9561, Avg Acc: 0.1151
+INFO:local_logger:Epoch[012/300], Step[0900/1602], Avg Loss: 4.9952, Avg Acc: 0.1075
+INFO:local_logger:Epoch[012/300], Step[0900/1602], Avg Loss: 5.0117, Avg Acc: 0.1074
+INFO:master_logger:Epoch[012/300], Step[0900/1602], Avg Loss: 4.9895, Avg Acc: 0.1099
+INFO:local_logger:Epoch[012/300], Step[0900/1602], Avg Loss: 4.9950, Avg Acc: 0.1098
+INFO:local_logger:Epoch[012/300], Step[0950/1602], Avg Loss: 5.0053, Avg Acc: 0.1079
+INFO:local_logger:Epoch[012/300], Step[0950/1602], Avg Loss: 4.9953, Avg Acc: 0.1104
+INFO:local_logger:Epoch[012/300], Step[0950/1602], Avg Loss: 4.9554, Avg Acc: 0.1152
+INFO:local_logger:Epoch[012/300], Step[0950/1602], Avg Loss: 4.9948, Avg Acc: 0.1077
+INFO:master_logger:Epoch[012/300], Step[0950/1602], Avg Loss: 4.9877, Avg Acc: 0.1103
+INFO:local_logger:Epoch[012/300], Step[1000/1602], Avg Loss: 4.9918, Avg Acc: 0.1083
+INFO:local_logger:Epoch[012/300], Step[1000/1602], Avg Loss: 5.0054, Avg Acc: 0.1077
+INFO:local_logger:Epoch[012/300], Step[1000/1602], Avg Loss: 4.9987, Avg Acc: 0.1105
+INFO:local_logger:Epoch[012/300], Step[1000/1602], Avg Loss: 4.9532, Avg Acc: 0.1151
+INFO:master_logger:Epoch[012/300], Step[1000/1602], Avg Loss: 4.9873, Avg Acc: 0.1104
+INFO:local_logger:Epoch[012/300], Step[1050/1602], Avg Loss: 5.0058, Avg Acc: 0.1081
+INFO:local_logger:Epoch[012/300], Step[1050/1602], Avg Loss: 4.9571, Avg Acc: 0.1140
+INFO:local_logger:Epoch[012/300], Step[1050/1602], Avg Loss: 5.0012, Avg Acc: 0.1102
+INFO:master_logger:Epoch[012/300], Step[1050/1602], Avg Loss: 4.9896, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[1050/1602], Avg Loss: 4.9941, Avg Acc: 0.1082
+INFO:local_logger:Epoch[012/300], Step[1100/1602], Avg Loss: 5.0024, Avg Acc: 0.1095
+INFO:local_logger:Epoch[012/300], Step[1100/1602], Avg Loss: 4.9566, Avg Acc: 0.1128
+INFO:local_logger:Epoch[012/300], Step[1100/1602], Avg Loss: 4.9985, Avg Acc: 0.1106
+INFO:local_logger:Epoch[012/300], Step[1100/1602], Avg Loss: 4.9921, Avg Acc: 0.1077
+INFO:master_logger:Epoch[012/300], Step[1100/1602], Avg Loss: 4.9874, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[1150/1602], Avg Loss: 5.0038, Avg Acc: 0.1093
+INFO:local_logger:Epoch[012/300], Step[1150/1602], Avg Loss: 4.9976, Avg Acc: 0.1107
+INFO:local_logger:Epoch[012/300], Step[1150/1602], Avg Loss: 4.9564, Avg Acc: 0.1129
+INFO:local_logger:Epoch[012/300], Step[1150/1602], Avg Loss: 4.9955, Avg Acc: 0.1074
+INFO:master_logger:Epoch[012/300], Step[1150/1602], Avg Loss: 4.9883, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[1200/1602], Avg Loss: 4.9944, Avg Acc: 0.1076
+INFO:local_logger:Epoch[012/300], Step[1200/1602], Avg Loss: 5.0004, Avg Acc: 0.1098
+INFO:local_logger:Epoch[012/300], Step[1200/1602], Avg Loss: 4.9526, Avg Acc: 0.1131
+INFO:local_logger:Epoch[012/300], Step[1200/1602], Avg Loss: 4.9940, Avg Acc: 0.1099
+INFO:master_logger:Epoch[012/300], Step[1200/1602], Avg Loss: 4.9853, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[1250/1602], Avg Loss: 5.0031, Avg Acc: 0.1086
+INFO:local_logger:Epoch[012/300], Step[1250/1602], Avg Loss: 4.9907, Avg Acc: 0.1079
+INFO:local_logger:Epoch[012/300], Step[1250/1602], Avg Loss: 4.9967, Avg Acc: 0.1094
+INFO:local_logger:Epoch[012/300], Step[1250/1602], Avg Loss: 4.9544, Avg Acc: 0.1131
+INFO:master_logger:Epoch[012/300], Step[1250/1602], Avg Loss: 4.9862, Avg Acc: 0.1098
+INFO:local_logger:Epoch[012/300], Step[1300/1602], Avg Loss: 4.9957, Avg Acc: 0.1093
+INFO:local_logger:Epoch[012/300], Step[1300/1602], Avg Loss: 4.9880, Avg Acc: 0.1081
+INFO:local_logger:Epoch[012/300], Step[1300/1602], Avg Loss: 4.9940, Avg Acc: 0.1095
+INFO:master_logger:Epoch[012/300], Step[1300/1602], Avg Loss: 4.9832, Avg Acc: 0.1100
+INFO:local_logger:Epoch[012/300], Step[1300/1602], Avg Loss: 4.9552, Avg Acc: 0.1130
+INFO:local_logger:Epoch[012/300], Step[1350/1602], Avg Loss: 4.9925, Avg Acc: 0.1091
+INFO:local_logger:Epoch[012/300], Step[1350/1602], Avg Loss: 4.9954, Avg Acc: 0.1093
+INFO:local_logger:Epoch[012/300], Step[1350/1602], Avg Loss: 4.9850, Avg Acc: 0.1083
+INFO:local_logger:Epoch[012/300], Step[1350/1602], Avg Loss: 4.9562, Avg Acc: 0.1126
+INFO:master_logger:Epoch[012/300], Step[1350/1602], Avg Loss: 4.9823, Avg Acc: 0.1098
+INFO:local_logger:Epoch[012/300], Step[1400/1602], Avg Loss: 4.9847, Avg Acc: 0.1085
+INFO:local_logger:Epoch[012/300], Step[1400/1602], Avg Loss: 4.9922, Avg Acc: 0.1100
+INFO:local_logger:Epoch[012/300], Step[1400/1602], Avg Loss: 4.9942, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[1400/1602], Avg Loss: 4.9575, Avg Acc: 0.1126
+INFO:master_logger:Epoch[012/300], Step[1400/1602], Avg Loss: 4.9822, Avg Acc: 0.1103
+INFO:local_logger:Epoch[012/300], Step[1450/1602], Avg Loss: 4.9563, Avg Acc: 0.1118
+INFO:local_logger:Epoch[012/300], Step[1450/1602], Avg Loss: 4.9891, Avg Acc: 0.1103
+INFO:local_logger:Epoch[012/300], Step[1450/1602], Avg Loss: 4.9941, Avg Acc: 0.1099
+INFO:local_logger:Epoch[012/300], Step[1450/1602], Avg Loss: 4.9795, Avg Acc: 0.1093
+INFO:master_logger:Epoch[012/300], Step[1450/1602], Avg Loss: 4.9797, Avg Acc: 0.1103
+INFO:local_logger:Epoch[012/300], Step[1500/1602], Avg Loss: 4.9903, Avg Acc: 0.1100
+INFO:local_logger:Epoch[012/300], Step[1500/1602], Avg Loss: 4.9580, Avg Acc: 0.1116
+INFO:local_logger:Epoch[012/300], Step[1500/1602], Avg Loss: 4.9960, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[1500/1602], Avg Loss: 4.9750, Avg Acc: 0.1106
+INFO:master_logger:Epoch[012/300], Step[1500/1602], Avg Loss: 4.9798, Avg Acc: 0.1104
+INFO:local_logger:Epoch[012/300], Step[1550/1602], Avg Loss: 4.9909, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[1550/1602], Avg Loss: 4.9574, Avg Acc: 0.1115
+INFO:local_logger:Epoch[012/300], Step[1550/1602], Avg Loss: 4.9943, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[1550/1602], Avg Loss: 4.9754, Avg Acc: 0.1102
+INFO:master_logger:Epoch[012/300], Step[1550/1602], Avg Loss: 4.9795, Avg Acc: 0.1102
+INFO:local_logger:Epoch[012/300], Step[1600/1602], Avg Loss: 4.9890, Avg Acc: 0.1101
+INFO:local_logger:Epoch[012/300], Step[1600/1602], Avg Loss: 4.9553, Avg Acc: 0.1118
+INFO:master_logger:Epoch[012/300], Step[1600/1602], Avg Loss: 4.9783, Avg Acc: 0.1105
+INFO:local_logger:Epoch[012/300], Step[1600/1602], Avg Loss: 4.9944, Avg Acc: 0.1096
+INFO:local_logger:Epoch[012/300], Step[1600/1602], Avg Loss: 4.9744, Avg Acc: 0.1106
+INFO:local_logger:----- Epoch[012/300], Train Loss: 4.9745, Train Acc: 0.1106, time: 3702.98
+INFO:local_logger:Now training epoch 13. LR=0.000390
+INFO:local_logger:----- Epoch[012/300], Train Loss: 4.9554, Train Acc: 0.1118, time: 3702.66
+INFO:local_logger:Now training epoch 13. LR=0.000390
+INFO:local_logger:----- Epoch[012/300], Train Loss: 4.9941, Train Acc: 0.1096, time: 3702.66
+INFO:local_logger:----- Epoch[012/300], Train Loss: 4.9887, Train Acc: 0.1100, time: 3702.42
+INFO:local_logger:Now training epoch 13. LR=0.000390
+INFO:master_logger:----- Epoch[012/300], Train Loss: 4.9782, Train Acc: 0.1105, time: 3702.42
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-12-Loss-4.988746055176861.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-12-Loss-4.988746055176861.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-12-Loss-4.988746055176861-EMA.pdparams
+INFO:local_logger:Now training epoch 13. LR=0.000390
+INFO:master_logger:Now training epoch 13. LR=0.000390
+INFO:local_logger:Epoch[013/300], Step[0000/1602], Avg Loss: 5.4218, Avg Acc: 0.0850
+INFO:local_logger:Epoch[013/300], Step[0000/1602], Avg Loss: 4.3067, Avg Acc: 0.2300
+INFO:master_logger:Epoch[013/300], Step[0000/1602], Avg Loss: 4.8928, Avg Acc: 0.1375
+INFO:local_logger:Epoch[013/300], Step[0000/1602], Avg Loss: 5.5367, Avg Acc: 0.0250
+INFO:local_logger:Epoch[013/300], Step[0000/1602], Avg Loss: 4.3059, Avg Acc: 0.2100
+INFO:local_logger:Epoch[013/300], Step[0050/1602], Avg Loss: 4.8193, Avg Acc: 0.1037
+INFO:local_logger:Epoch[013/300], Step[0050/1602], Avg Loss: 4.9565, Avg Acc: 0.1025
+INFO:local_logger:Epoch[013/300], Step[0050/1602], Avg Loss: 4.9447, Avg Acc: 0.1128
+INFO:local_logger:Epoch[013/300], Step[0050/1602], Avg Loss: 4.8669, Avg Acc: 0.1281
+INFO:master_logger:Epoch[013/300], Step[0050/1602], Avg Loss: 4.8968, Avg Acc: 0.1118
+INFO:local_logger:Epoch[013/300], Step[0100/1602], Avg Loss: 4.8175, Avg Acc: 0.0996
+INFO:local_logger:Epoch[013/300], Step[0100/1602], Avg Loss: 4.9573, Avg Acc: 0.1113
+INFO:local_logger:Epoch[013/300], Step[0100/1602], Avg Loss: 4.9418, Avg Acc: 0.1135
+INFO:local_logger:Epoch[013/300], Step[0100/1602], Avg Loss: 4.8894, Avg Acc: 0.1212
+INFO:master_logger:Epoch[013/300], Step[0100/1602], Avg Loss: 4.9015, Avg Acc: 0.1114
+INFO:local_logger:Epoch[013/300], Step[0150/1602], Avg Loss: 4.9458, Avg Acc: 0.1128
+INFO:local_logger:Epoch[013/300], Step[0150/1602], Avg Loss: 4.8927, Avg Acc: 0.1222
+INFO:local_logger:Epoch[013/300], Step[0150/1602], Avg Loss: 4.8889, Avg Acc: 0.1230
+INFO:master_logger:Epoch[013/300], Step[0150/1602], Avg Loss: 4.8917, Avg Acc: 0.1153
+INFO:local_logger:Epoch[013/300], Step[0150/1602], Avg Loss: 4.8393, Avg Acc: 0.1032
+INFO:local_logger:Epoch[013/300], Step[0200/1602], Avg Loss: 4.9495, Avg Acc: 0.1121
+INFO:local_logger:Epoch[013/300], Step[0200/1602], Avg Loss: 4.9047, Avg Acc: 0.1217
+INFO:local_logger:Epoch[013/300], Step[0200/1602], Avg Loss: 4.9001, Avg Acc: 0.1191
+INFO:local_logger:Epoch[013/300], Step[0200/1602], Avg Loss: 4.8188, Avg Acc: 0.1096
+INFO:master_logger:Epoch[013/300], Step[0200/1602], Avg Loss: 4.8933, Avg Acc: 0.1156
+INFO:local_logger:Epoch[013/300], Step[0250/1602], Avg Loss: 4.9373, Avg Acc: 0.1162
+INFO:master_logger:Epoch[013/300], Step[0250/1602], Avg Loss: 4.9016, Avg Acc: 0.1165
+INFO:local_logger:Epoch[013/300], Step[0250/1602], Avg Loss: 4.8962, Avg Acc: 0.1197
+INFO:local_logger:Epoch[013/300], Step[0250/1602], Avg Loss: 4.9232, Avg Acc: 0.1204
+INFO:local_logger:Epoch[013/300], Step[0250/1602], Avg Loss: 4.8496, Avg Acc: 0.1097
+INFO:local_logger:Epoch[013/300], Step[0300/1602], Avg Loss: 4.9404, Avg Acc: 0.1192
+INFO:local_logger:Epoch[013/300], Step[0300/1602], Avg Loss: 4.9361, Avg Acc: 0.1200
+INFO:local_logger:Epoch[013/300], Step[0300/1602], Avg Loss: 4.8982, Avg Acc: 0.1221
+INFO:local_logger:Epoch[013/300], Step[0300/1602], Avg Loss: 4.8689, Avg Acc: 0.1112
+INFO:master_logger:Epoch[013/300], Step[0300/1602], Avg Loss: 4.9109, Avg Acc: 0.1181
+INFO:local_logger:Epoch[013/300], Step[0350/1602], Avg Loss: 4.9269, Avg Acc: 0.1215
+INFO:master_logger:Epoch[013/300], Step[0350/1602], Avg Loss: 4.9085, Avg Acc: 0.1183
+INFO:local_logger:Epoch[013/300], Step[0350/1602], Avg Loss: 4.9225, Avg Acc: 0.1195
+INFO:local_logger:Epoch[013/300], Step[0350/1602], Avg Loss: 4.9084, Avg Acc: 0.1215
+INFO:local_logger:Epoch[013/300], Step[0350/1602], Avg Loss: 4.8759, Avg Acc: 0.1109
+INFO:local_logger:Epoch[013/300], Step[0400/1602], Avg Loss: 4.9297, Avg Acc: 0.1197
+INFO:local_logger:Epoch[013/300], Step[0400/1602], Avg Loss: 4.9139, Avg Acc: 0.1199
+INFO:local_logger:Epoch[013/300], Step[0400/1602], Avg Loss: 4.9265, Avg Acc: 0.1200
+INFO:master_logger:Epoch[013/300], Step[0400/1602], Avg Loss: 4.9131, Avg Acc: 0.1177
+INFO:local_logger:Epoch[013/300], Step[0400/1602], Avg Loss: 4.8823, Avg Acc: 0.1113
+INFO:local_logger:Epoch[013/300], Step[0450/1602], Avg Loss: 4.9107, Avg Acc: 0.1191
+INFO:local_logger:Epoch[013/300], Step[0450/1602], Avg Loss: 4.8932, Avg Acc: 0.1103
+INFO:local_logger:Epoch[013/300], Step[0450/1602], Avg Loss: 4.9236, Avg Acc: 0.1193
+INFO:local_logger:Epoch[013/300], Step[0450/1602], Avg Loss: 4.9132, Avg Acc: 0.1195
+INFO:master_logger:Epoch[013/300], Step[0450/1602], Avg Loss: 4.9102, Avg Acc: 0.1170
+INFO:local_logger:Epoch[013/300], Step[0500/1602], Avg Loss: 4.9117, Avg Acc: 0.1194
+INFO:local_logger:Epoch[013/300], Step[0500/1602], Avg Loss: 4.9043, Avg Acc: 0.1186
+INFO:local_logger:Epoch[013/300], Step[0500/1602], Avg Loss: 4.9272, Avg Acc: 0.1186
+INFO:master_logger:Epoch[013/300], Step[0500/1602], Avg Loss: 4.9084, Avg Acc: 0.1165
+INFO:local_logger:Epoch[013/300], Step[0500/1602], Avg Loss: 4.8903, Avg Acc: 0.1093
+INFO:local_logger:Epoch[013/300], Step[0550/1602], Avg Loss: 4.9203, Avg Acc: 0.1197
+INFO:local_logger:Epoch[013/300], Step[0550/1602], Avg Loss: 4.9190, Avg Acc: 0.1188
+INFO:local_logger:Epoch[013/300], Step[0550/1602], Avg Loss: 4.9127, Avg Acc: 0.1168
+INFO:local_logger:Epoch[013/300], Step[0550/1602], Avg Loss: 4.8970, Avg Acc: 0.1102
+INFO:master_logger:Epoch[013/300], Step[0550/1602], Avg Loss: 4.9122, Avg Acc: 0.1164
+INFO:local_logger:Epoch[013/300], Step[0600/1602], Avg Loss: 4.9151, Avg Acc: 0.1170
+INFO:local_logger:Epoch[013/300], Step[0600/1602], Avg Loss: 4.9098, Avg Acc: 0.1088
+INFO:local_logger:Epoch[013/300], Step[0600/1602], Avg Loss: 4.9147, Avg Acc: 0.1193
+INFO:local_logger:Epoch[013/300], Step[0600/1602], Avg Loss: 4.9142, Avg Acc: 0.1149
+INFO:master_logger:Epoch[013/300], Step[0600/1602], Avg Loss: 4.9134, Avg Acc: 0.1150
+INFO:local_logger:Epoch[013/300], Step[0650/1602], Avg Loss: 4.9179, Avg Acc: 0.1175
+INFO:local_logger:Epoch[013/300], Step[0650/1602], Avg Loss: 4.9123, Avg Acc: 0.1190
+INFO:local_logger:Epoch[013/300], Step[0650/1602], Avg Loss: 4.9169, Avg Acc: 0.1141
+INFO:master_logger:Epoch[013/300], Step[0650/1602], Avg Loss: 4.9136, Avg Acc: 0.1151
+INFO:local_logger:Epoch[013/300], Step[0650/1602], Avg Loss: 4.9074, Avg Acc: 0.1097
+INFO:local_logger:Epoch[013/300], Step[0700/1602], Avg Loss: 4.9212, Avg Acc: 0.1151
+INFO:local_logger:Epoch[013/300], Step[0700/1602], Avg Loss: 4.9218, Avg Acc: 0.1176
+INFO:local_logger:Epoch[013/300], Step[0700/1602], Avg Loss: 4.9136, Avg Acc: 0.1176
+INFO:master_logger:Epoch[013/300], Step[0700/1602], Avg Loss: 4.9176, Avg Acc: 0.1149
+INFO:local_logger:Epoch[013/300], Step[0700/1602], Avg Loss: 4.9138, Avg Acc: 0.1094
+INFO:local_logger:Epoch[013/300], Step[0750/1602], Avg Loss: 4.9146, Avg Acc: 0.1176
+INFO:local_logger:Epoch[013/300], Step[0750/1602], Avg Loss: 4.9138, Avg Acc: 0.1140
+INFO:local_logger:Epoch[013/300], Step[0750/1602], Avg Loss: 4.9200, Avg Acc: 0.1182
+INFO:local_logger:Epoch[013/300], Step[0750/1602], Avg Loss: 4.9107, Avg Acc: 0.1088
+INFO:master_logger:Epoch[013/300], Step[0750/1602], Avg Loss: 4.9148, Avg Acc: 0.1147
+INFO:local_logger:Epoch[013/300], Step[0800/1602], Avg Loss: 4.9083, Avg Acc: 0.1170
+INFO:local_logger:Epoch[013/300], Step[0800/1602], Avg Loss: 4.9173, Avg Acc: 0.1176
+INFO:local_logger:Epoch[013/300], Step[0800/1602], Avg Loss: 4.9070, Avg Acc: 0.1107
+INFO:local_logger:Epoch[013/300], Step[0800/1602], Avg Loss: 4.9136, Avg Acc: 0.1150
+INFO:master_logger:Epoch[013/300], Step[0800/1602], Avg Loss: 4.9116, Avg Acc: 0.1151
+INFO:local_logger:Epoch[013/300], Step[0850/1602], Avg Loss: 4.9095, Avg Acc: 0.1166
+INFO:local_logger:Epoch[013/300], Step[0850/1602], Avg Loss: 4.9131, Avg Acc: 0.1182
+INFO:local_logger:Epoch[013/300], Step[0850/1602], Avg Loss: 4.9111, Avg Acc: 0.1155
+INFO:local_logger:Epoch[013/300], Step[0850/1602], Avg Loss: 4.9079, Avg Acc: 0.1105
+INFO:master_logger:Epoch[013/300], Step[0850/1602], Avg Loss: 4.9104, Avg Acc: 0.1152
+INFO:local_logger:Epoch[013/300], Step[0900/1602], Avg Loss: 4.9097, Avg Acc: 0.1173
+INFO:local_logger:Epoch[013/300], Step[0900/1602], Avg Loss: 4.9010, Avg Acc: 0.1110
+INFO:local_logger:Epoch[013/300], Step[0900/1602], Avg Loss: 4.9163, Avg Acc: 0.1170
+INFO:local_logger:Epoch[013/300], Step[0900/1602], Avg Loss: 4.9151, Avg Acc: 0.1154
+INFO:master_logger:Epoch[013/300], Step[0900/1602], Avg Loss: 4.9105, Avg Acc: 0.1152
+INFO:local_logger:Epoch[013/300], Step[0950/1602], Avg Loss: 4.9052, Avg Acc: 0.1179
+INFO:local_logger:Epoch[013/300], Step[0950/1602], Avg Loss: 4.9174, Avg Acc: 0.1161
+INFO:local_logger:Epoch[013/300], Step[0950/1602], Avg Loss: 4.9181, Avg Acc: 0.1146
+INFO:local_logger:Epoch[013/300], Step[0950/1602], Avg Loss: 4.9052, Avg Acc: 0.1121
+INFO:master_logger:Epoch[013/300], Step[0950/1602], Avg Loss: 4.9115, Avg Acc: 0.1152
+INFO:local_logger:Epoch[013/300], Step[1000/1602], Avg Loss: 4.9075, Avg Acc: 0.1183
+INFO:local_logger:Epoch[013/300], Step[1000/1602], Avg Loss: 4.9199, Avg Acc: 0.1143
+INFO:local_logger:Epoch[013/300], Step[1000/1602], Avg Loss: 4.9085, Avg Acc: 0.1124
+INFO:local_logger:Epoch[013/300], Step[1000/1602], Avg Loss: 4.9178, Avg Acc: 0.1154
+INFO:master_logger:Epoch[013/300], Step[1000/1602], Avg Loss: 4.9134, Avg Acc: 0.1151
+INFO:local_logger:Epoch[013/300], Step[1050/1602], Avg Loss: 4.9081, Avg Acc: 0.1183
+INFO:local_logger:Epoch[013/300], Step[1050/1602], Avg Loss: 4.9145, Avg Acc: 0.1124
+INFO:local_logger:Epoch[013/300], Step[1050/1602], Avg Loss: 4.9170, Avg Acc: 0.1146
+INFO:local_logger:Epoch[013/300], Step[1050/1602], Avg Loss: 4.9144, Avg Acc: 0.1165
+INFO:master_logger:Epoch[013/300], Step[1050/1602], Avg Loss: 4.9135, Avg Acc: 0.1155
+INFO:local_logger:Epoch[013/300], Step[1100/1602], Avg Loss: 4.9079, Avg Acc: 0.1181
+INFO:master_logger:Epoch[013/300], Step[1100/1602], Avg Loss: 4.9143, Avg Acc: 0.1151
+INFO:local_logger:Epoch[013/300], Step[1100/1602], Avg Loss: 4.9214, Avg Acc: 0.1142
+INFO:local_logger:Epoch[013/300], Step[1100/1602], Avg Loss: 4.9116, Avg Acc: 0.1164
+INFO:local_logger:Epoch[013/300], Step[1100/1602], Avg Loss: 4.9162, Avg Acc: 0.1117
+INFO:local_logger:Epoch[013/300], Step[1150/1602], Avg Loss: 4.9129, Avg Acc: 0.1119
+INFO:local_logger:Epoch[013/300], Step[1150/1602], Avg Loss: 4.9082, Avg Acc: 0.1173
+INFO:local_logger:Epoch[013/300], Step[1150/1602], Avg Loss: 4.9101, Avg Acc: 0.1154
+INFO:local_logger:Epoch[013/300], Step[1150/1602], Avg Loss: 4.9235, Avg Acc: 0.1140
+INFO:master_logger:Epoch[013/300], Step[1150/1602], Avg Loss: 4.9137, Avg Acc: 0.1146
+INFO:local_logger:Epoch[013/300], Step[1200/1602], Avg Loss: 4.9105, Avg Acc: 0.1167
+INFO:local_logger:Epoch[013/300], Step[1200/1602], Avg Loss: 4.9201, Avg Acc: 0.1145
+INFO:local_logger:Epoch[013/300], Step[1200/1602], Avg Loss: 4.9065, Avg Acc: 0.1162
+INFO:local_logger:Epoch[013/300], Step[1200/1602], Avg Loss: 4.9114, Avg Acc: 0.1115
+INFO:master_logger:Epoch[013/300], Step[1200/1602], Avg Loss: 4.9121, Avg Acc: 0.1147
+INFO:local_logger:Epoch[013/300], Step[1250/1602], Avg Loss: 4.9100, Avg Acc: 0.1117
+INFO:local_logger:Epoch[013/300], Step[1250/1602], Avg Loss: 4.9045, Avg Acc: 0.1166
+INFO:local_logger:Epoch[013/300], Step[1250/1602], Avg Loss: 4.9079, Avg Acc: 0.1161
+INFO:local_logger:Epoch[013/300], Step[1250/1602], Avg Loss: 4.9204, Avg Acc: 0.1140
+INFO:master_logger:Epoch[013/300], Step[1250/1602], Avg Loss: 4.9107, Avg Acc: 0.1146
+INFO:local_logger:Epoch[013/300], Step[1300/1602], Avg Loss: 4.9097, Avg Acc: 0.1122
+INFO:local_logger:Epoch[013/300], Step[1300/1602], Avg Loss: 4.9025, Avg Acc: 0.1166
+INFO:local_logger:Epoch[013/300], Step[1300/1602], Avg Loss: 4.9186, Avg Acc: 0.1147
+INFO:local_logger:Epoch[013/300], Step[1300/1602], Avg Loss: 4.9057, Avg Acc: 0.1164
+INFO:master_logger:Epoch[013/300], Step[1300/1602], Avg Loss: 4.9091, Avg Acc: 0.1150
+INFO:local_logger:Epoch[013/300], Step[1350/1602], Avg Loss: 4.8993, Avg Acc: 0.1176
+INFO:local_logger:Epoch[013/300], Step[1350/1602], Avg Loss: 4.9152, Avg Acc: 0.1147
+INFO:local_logger:Epoch[013/300], Step[1350/1602], Avg Loss: 4.9124, Avg Acc: 0.1114
+INFO:local_logger:Epoch[013/300], Step[1350/1602], Avg Loss: 4.9109, Avg Acc: 0.1161
+INFO:master_logger:Epoch[013/300], Step[1350/1602], Avg Loss: 4.9095, Avg Acc: 0.1149
+INFO:local_logger:Epoch[013/300], Step[1400/1602], Avg Loss: 4.9151, Avg Acc: 0.1150
+INFO:local_logger:Epoch[013/300], Step[1400/1602], Avg Loss: 4.9104, Avg Acc: 0.1162
+INFO:local_logger:Epoch[013/300], Step[1400/1602], Avg Loss: 4.9043, Avg Acc: 0.1166
+INFO:local_logger:Epoch[013/300], Step[1400/1602], Avg Loss: 4.9131, Avg Acc: 0.1112
+INFO:master_logger:Epoch[013/300], Step[1400/1602], Avg Loss: 4.9108, Avg Acc: 0.1148
+INFO:local_logger:Epoch[013/300], Step[1450/1602], Avg Loss: 4.9133, Avg Acc: 0.1119
+INFO:local_logger:Epoch[013/300], Step[1450/1602], Avg Loss: 4.9051, Avg Acc: 0.1162
+INFO:local_logger:Epoch[013/300], Step[1450/1602], Avg Loss: 4.9058, Avg Acc: 0.1149
+INFO:local_logger:Epoch[013/300], Step[1450/1602], Avg Loss: 4.9132, Avg Acc: 0.1146
+INFO:master_logger:Epoch[013/300], Step[1450/1602], Avg Loss: 4.9093, Avg Acc: 0.1144
+INFO:local_logger:Epoch[013/300], Step[1500/1602], Avg Loss: 4.9035, Avg Acc: 0.1166
+INFO:local_logger:Epoch[013/300], Step[1500/1602], Avg Loss: 4.9111, Avg Acc: 0.1148
+INFO:local_logger:Epoch[013/300], Step[1500/1602], Avg Loss: 4.9044, Avg Acc: 0.1155
+INFO:master_logger:Epoch[013/300], Step[1500/1602], Avg Loss: 4.9073, Avg Acc: 0.1150
+INFO:local_logger:Epoch[013/300], Step[1500/1602], Avg Loss: 4.9101, Avg Acc: 0.1129
+INFO:local_logger:Epoch[013/300], Step[1550/1602], Avg Loss: 4.9038, Avg Acc: 0.1165
+INFO:local_logger:Epoch[013/300], Step[1550/1602], Avg Loss: 4.9134, Avg Acc: 0.1150
+INFO:local_logger:Epoch[013/300], Step[1550/1602], Avg Loss: 4.9041, Avg Acc: 0.1154
+INFO:local_logger:Epoch[013/300], Step[1550/1602], Avg Loss: 4.9073, Avg Acc: 0.1132
+INFO:master_logger:Epoch[013/300], Step[1550/1602], Avg Loss: 4.9071, Avg Acc: 0.1150
+INFO:local_logger:Epoch[013/300], Step[1600/1602], Avg Loss: 4.9094, Avg Acc: 0.1134
+INFO:local_logger:Epoch[013/300], Step[1600/1602], Avg Loss: 4.9037, Avg Acc: 0.1162
+INFO:local_logger:Epoch[013/300], Step[1600/1602], Avg Loss: 4.9122, Avg Acc: 0.1151
+INFO:local_logger:Epoch[013/300], Step[1600/1602], Avg Loss: 4.8998, Avg Acc: 0.1174
+INFO:master_logger:Epoch[013/300], Step[1600/1602], Avg Loss: 4.9063, Avg Acc: 0.1155
+INFO:local_logger:----- Epoch[013/300], Train Loss: 4.9038, Train Acc: 0.1161, time: 3719.47
+INFO:local_logger:Now training epoch 14. LR=0.000390
+INFO:local_logger:----- Epoch[013/300], Train Loss: 4.9093, Train Acc: 0.1135, time: 3719.47
+INFO:local_logger:----- Epoch[013/300], Train Loss: 4.8995, Train Acc: 0.1174, time: 3719.21
+INFO:local_logger:Now training epoch 14. LR=0.000390
+INFO:master_logger:----- Epoch[013/300], Train Loss: 4.9062, Train Acc: 0.1155, time: 3719.21
+INFO:local_logger:----- Epoch[013/300], Train Loss: 4.9120, Train Acc: 0.1151, time: 3719.47
+INFO:local_logger:Now training epoch 14. LR=0.000390
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-13-Loss-4.899523314436626.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-13-Loss-4.899523314436626.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-13-Loss-4.899523314436626-EMA.pdparams
+INFO:local_logger:Now training epoch 14. LR=0.000390
+INFO:master_logger:Now training epoch 14. LR=0.000390
+INFO:local_logger:Epoch[014/300], Step[0000/1602], Avg Loss: 5.3834, Avg Acc: 0.0950
+INFO:local_logger:Epoch[014/300], Step[0000/1602], Avg Loss: 5.4047, Avg Acc: 0.0250
+INFO:local_logger:Epoch[014/300], Step[0000/1602], Avg Loss: 5.4770, Avg Acc: 0.0600
+INFO:local_logger:Epoch[014/300], Step[0000/1602], Avg Loss: 4.5601, Avg Acc: 0.2400
+INFO:master_logger:Epoch[014/300], Step[0000/1602], Avg Loss: 5.2063, Avg Acc: 0.1050
+INFO:local_logger:Epoch[014/300], Step[0050/1602], Avg Loss: 4.8626, Avg Acc: 0.1280
+INFO:local_logger:Epoch[014/300], Step[0050/1602], Avg Loss: 4.9034, Avg Acc: 0.1270
+INFO:local_logger:Epoch[014/300], Step[0050/1602], Avg Loss: 4.8778, Avg Acc: 0.1245
+INFO:master_logger:Epoch[014/300], Step[0050/1602], Avg Loss: 4.8565, Avg Acc: 0.1271
+INFO:local_logger:Epoch[014/300], Step[0050/1602], Avg Loss: 4.7822, Avg Acc: 0.1289
+INFO:local_logger:Epoch[014/300], Step[0100/1602], Avg Loss: 4.9123, Avg Acc: 0.1326
+INFO:local_logger:Epoch[014/300], Step[0100/1602], Avg Loss: 4.8637, Avg Acc: 0.1189
+INFO:local_logger:Epoch[014/300], Step[0100/1602], Avg Loss: 4.8253, Avg Acc: 0.1193
+INFO:local_logger:Epoch[014/300], Step[0100/1602], Avg Loss: 4.8791, Avg Acc: 0.1224
+INFO:master_logger:Epoch[014/300], Step[0100/1602], Avg Loss: 4.8701, Avg Acc: 0.1233
+INFO:local_logger:Epoch[014/300], Step[0150/1602], Avg Loss: 4.8464, Avg Acc: 0.1306
+INFO:local_logger:Epoch[014/300], Step[0150/1602], Avg Loss: 4.8334, Avg Acc: 0.1182
+INFO:local_logger:Epoch[014/300], Step[0150/1602], Avg Loss: 4.9366, Avg Acc: 0.1130
+INFO:local_logger:Epoch[014/300], Step[0150/1602], Avg Loss: 4.8856, Avg Acc: 0.1193
+INFO:master_logger:Epoch[014/300], Step[0150/1602], Avg Loss: 4.8755, Avg Acc: 0.1203
+INFO:local_logger:Epoch[014/300], Step[0200/1602], Avg Loss: 4.8615, Avg Acc: 0.1294
+INFO:local_logger:Epoch[014/300], Step[0200/1602], Avg Loss: 4.8206, Avg Acc: 0.1206
+INFO:local_logger:Epoch[014/300], Step[0200/1602], Avg Loss: 4.8627, Avg Acc: 0.1217
+INFO:local_logger:Epoch[014/300], Step[0200/1602], Avg Loss: 4.9341, Avg Acc: 0.1152
+INFO:master_logger:Epoch[014/300], Step[0200/1602], Avg Loss: 4.8697, Avg Acc: 0.1217
+INFO:local_logger:Epoch[014/300], Step[0250/1602], Avg Loss: 4.8656, Avg Acc: 0.1291
+INFO:local_logger:Epoch[014/300], Step[0250/1602], Avg Loss: 4.8438, Avg Acc: 0.1234
+INFO:local_logger:Epoch[014/300], Step[0250/1602], Avg Loss: 4.8461, Avg Acc: 0.1148
+INFO:local_logger:Epoch[014/300], Step[0250/1602], Avg Loss: 4.9239, Avg Acc: 0.1183
+INFO:master_logger:Epoch[014/300], Step[0250/1602], Avg Loss: 4.8699, Avg Acc: 0.1214
+INFO:local_logger:Epoch[014/300], Step[0300/1602], Avg Loss: 4.8428, Avg Acc: 0.1150
+INFO:local_logger:Epoch[014/300], Step[0300/1602], Avg Loss: 4.8417, Avg Acc: 0.1232
+INFO:local_logger:Epoch[014/300], Step[0300/1602], Avg Loss: 4.9057, Avg Acc: 0.1163
+INFO:local_logger:Epoch[014/300], Step[0300/1602], Avg Loss: 4.8706, Avg Acc: 0.1257
+INFO:master_logger:Epoch[014/300], Step[0300/1602], Avg Loss: 4.8652, Avg Acc: 0.1200
+INFO:local_logger:Epoch[014/300], Step[0350/1602], Avg Loss: 4.8762, Avg Acc: 0.1261
+INFO:local_logger:Epoch[014/300], Step[0350/1602], Avg Loss: 4.8539, Avg Acc: 0.1225
+INFO:local_logger:Epoch[014/300], Step[0350/1602], Avg Loss: 4.8989, Avg Acc: 0.1171
+INFO:local_logger:Epoch[014/300], Step[0350/1602], Avg Loss: 4.8405, Avg Acc: 0.1162
+INFO:master_logger:Epoch[014/300], Step[0350/1602], Avg Loss: 4.8674, Avg Acc: 0.1205
+INFO:local_logger:Epoch[014/300], Step[0400/1602], Avg Loss: 4.8745, Avg Acc: 0.1252
+INFO:local_logger:Epoch[014/300], Step[0400/1602], Avg Loss: 4.8905, Avg Acc: 0.1214
+INFO:local_logger:Epoch[014/300], Step[0400/1602], Avg Loss: 4.8417, Avg Acc: 0.1152
+INFO:local_logger:Epoch[014/300], Step[0400/1602], Avg Loss: 4.8309, Avg Acc: 0.1254
+INFO:master_logger:Epoch[014/300], Step[0400/1602], Avg Loss: 4.8594, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[0450/1602], Avg Loss: 4.8809, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[0450/1602], Avg Loss: 4.8718, Avg Acc: 0.1262
+INFO:local_logger:Epoch[014/300], Step[0450/1602], Avg Loss: 4.8390, Avg Acc: 0.1241
+INFO:local_logger:Epoch[014/300], Step[0450/1602], Avg Loss: 4.8493, Avg Acc: 0.1152
+INFO:master_logger:Epoch[014/300], Step[0450/1602], Avg Loss: 4.8603, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[0500/1602], Avg Loss: 4.8684, Avg Acc: 0.1245
+INFO:local_logger:Epoch[014/300], Step[0500/1602], Avg Loss: 4.8502, Avg Acc: 0.1219
+INFO:local_logger:Epoch[014/300], Step[0500/1602], Avg Loss: 4.8807, Avg Acc: 0.1234
+INFO:local_logger:Epoch[014/300], Step[0500/1602], Avg Loss: 4.8486, Avg Acc: 0.1150
+INFO:master_logger:Epoch[014/300], Step[0500/1602], Avg Loss: 4.8620, Avg Acc: 0.1212
+INFO:local_logger:Epoch[014/300], Step[0550/1602], Avg Loss: 4.8832, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[0550/1602], Avg Loss: 4.8811, Avg Acc: 0.1239
+INFO:master_logger:Epoch[014/300], Step[0550/1602], Avg Loss: 4.8667, Avg Acc: 0.1210
+INFO:local_logger:Epoch[014/300], Step[0550/1602], Avg Loss: 4.8510, Avg Acc: 0.1201
+INFO:local_logger:Epoch[014/300], Step[0550/1602], Avg Loss: 4.8517, Avg Acc: 0.1182
+INFO:local_logger:Epoch[014/300], Step[0600/1602], Avg Loss: 4.8828, Avg Acc: 0.1225
+INFO:master_logger:Epoch[014/300], Step[0600/1602], Avg Loss: 4.8655, Avg Acc: 0.1211
+INFO:local_logger:Epoch[014/300], Step[0600/1602], Avg Loss: 4.8582, Avg Acc: 0.1177
+INFO:local_logger:Epoch[014/300], Step[0600/1602], Avg Loss: 4.8443, Avg Acc: 0.1222
+INFO:local_logger:Epoch[014/300], Step[0600/1602], Avg Loss: 4.8766, Avg Acc: 0.1223
+INFO:local_logger:Epoch[014/300], Step[0650/1602], Avg Loss: 4.8807, Avg Acc: 0.1217
+INFO:local_logger:Epoch[014/300], Step[0650/1602], Avg Loss: 4.8550, Avg Acc: 0.1186
+INFO:local_logger:Epoch[014/300], Step[0650/1602], Avg Loss: 4.8423, Avg Acc: 0.1210
+INFO:local_logger:Epoch[014/300], Step[0650/1602], Avg Loss: 4.8720, Avg Acc: 0.1219
+INFO:master_logger:Epoch[014/300], Step[0650/1602], Avg Loss: 4.8625, Avg Acc: 0.1208
+INFO:local_logger:Epoch[014/300], Step[0700/1602], Avg Loss: 4.8749, Avg Acc: 0.1208
+INFO:local_logger:Epoch[014/300], Step[0700/1602], Avg Loss: 4.8554, Avg Acc: 0.1191
+INFO:local_logger:Epoch[014/300], Step[0700/1602], Avg Loss: 4.8379, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[0700/1602], Avg Loss: 4.8754, Avg Acc: 0.1219
+INFO:master_logger:Epoch[014/300], Step[0700/1602], Avg Loss: 4.8609, Avg Acc: 0.1209
+INFO:local_logger:Epoch[014/300], Step[0750/1602], Avg Loss: 4.8778, Avg Acc: 0.1208
+INFO:local_logger:Epoch[014/300], Step[0750/1602], Avg Loss: 4.8762, Avg Acc: 0.1195
+INFO:local_logger:Epoch[014/300], Step[0750/1602], Avg Loss: 4.8593, Avg Acc: 0.1205
+INFO:local_logger:Epoch[014/300], Step[0750/1602], Avg Loss: 4.8432, Avg Acc: 0.1208
+INFO:master_logger:Epoch[014/300], Step[0750/1602], Avg Loss: 4.8642, Avg Acc: 0.1204
+INFO:local_logger:Epoch[014/300], Step[0800/1602], Avg Loss: 4.8445, Avg Acc: 0.1196
+INFO:local_logger:Epoch[014/300], Step[0800/1602], Avg Loss: 4.8760, Avg Acc: 0.1201
+INFO:local_logger:Epoch[014/300], Step[0800/1602], Avg Loss: 4.8723, Avg Acc: 0.1220
+INFO:local_logger:Epoch[014/300], Step[0800/1602], Avg Loss: 4.8621, Avg Acc: 0.1205
+INFO:master_logger:Epoch[014/300], Step[0800/1602], Avg Loss: 4.8637, Avg Acc: 0.1206
+INFO:local_logger:Epoch[014/300], Step[0850/1602], Avg Loss: 4.8631, Avg Acc: 0.1227
+INFO:local_logger:Epoch[014/300], Step[0850/1602], Avg Loss: 4.8673, Avg Acc: 0.1208
+INFO:local_logger:Epoch[014/300], Step[0850/1602], Avg Loss: 4.8430, Avg Acc: 0.1203
+INFO:local_logger:Epoch[014/300], Step[0850/1602], Avg Loss: 4.8547, Avg Acc: 0.1201
+INFO:master_logger:Epoch[014/300], Step[0850/1602], Avg Loss: 4.8570, Avg Acc: 0.1209
+INFO:local_logger:Epoch[014/300], Step[0900/1602], Avg Loss: 4.8614, Avg Acc: 0.1208
+INFO:local_logger:Epoch[014/300], Step[0900/1602], Avg Loss: 4.8578, Avg Acc: 0.1200
+INFO:local_logger:Epoch[014/300], Step[0900/1602], Avg Loss: 4.8398, Avg Acc: 0.1207
+INFO:master_logger:Epoch[014/300], Step[0900/1602], Avg Loss: 4.8544, Avg Acc: 0.1209
+INFO:local_logger:Epoch[014/300], Step[0900/1602], Avg Loss: 4.8585, Avg Acc: 0.1221
+INFO:local_logger:Epoch[014/300], Step[0950/1602], Avg Loss: 4.8636, Avg Acc: 0.1216
+INFO:local_logger:Epoch[014/300], Step[0950/1602], Avg Loss: 4.8618, Avg Acc: 0.1204
+INFO:local_logger:Epoch[014/300], Step[0950/1602], Avg Loss: 4.8574, Avg Acc: 0.1194
+INFO:local_logger:Epoch[014/300], Step[0950/1602], Avg Loss: 4.8420, Avg Acc: 0.1214
+INFO:master_logger:Epoch[014/300], Step[0950/1602], Avg Loss: 4.8562, Avg Acc: 0.1207
+INFO:local_logger:Epoch[014/300], Step[1000/1602], Avg Loss: 4.8587, Avg Acc: 0.1199
+INFO:local_logger:Epoch[014/300], Step[1000/1602], Avg Loss: 4.8573, Avg Acc: 0.1197
+INFO:local_logger:Epoch[014/300], Step[1000/1602], Avg Loss: 4.8345, Avg Acc: 0.1212
+INFO:local_logger:Epoch[014/300], Step[1000/1602], Avg Loss: 4.8651, Avg Acc: 0.1210
+INFO:master_logger:Epoch[014/300], Step[1000/1602], Avg Loss: 4.8539, Avg Acc: 0.1205
+INFO:local_logger:Epoch[014/300], Step[1050/1602], Avg Loss: 4.8594, Avg Acc: 0.1193
+INFO:local_logger:Epoch[014/300], Step[1050/1602], Avg Loss: 4.8607, Avg Acc: 0.1211
+INFO:local_logger:Epoch[014/300], Step[1050/1602], Avg Loss: 4.8387, Avg Acc: 0.1215
+INFO:master_logger:Epoch[014/300], Step[1050/1602], Avg Loss: 4.8539, Avg Acc: 0.1205
+INFO:local_logger:Epoch[014/300], Step[1050/1602], Avg Loss: 4.8567, Avg Acc: 0.1202
+INFO:local_logger:Epoch[014/300], Step[1100/1602], Avg Loss: 4.8532, Avg Acc: 0.1201
+INFO:local_logger:Epoch[014/300], Step[1100/1602], Avg Loss: 4.8396, Avg Acc: 0.1226
+INFO:local_logger:Epoch[014/300], Step[1100/1602], Avg Loss: 4.8590, Avg Acc: 0.1202
+INFO:local_logger:Epoch[014/300], Step[1100/1602], Avg Loss: 4.8626, Avg Acc: 0.1207
+INFO:master_logger:Epoch[014/300], Step[1100/1602], Avg Loss: 4.8536, Avg Acc: 0.1209
+INFO:local_logger:Epoch[014/300], Step[1150/1602], Avg Loss: 4.8566, Avg Acc: 0.1219
+INFO:local_logger:Epoch[014/300], Step[1150/1602], Avg Loss: 4.8552, Avg Acc: 0.1207
+INFO:local_logger:Epoch[014/300], Step[1150/1602], Avg Loss: 4.8505, Avg Acc: 0.1202
+INFO:local_logger:Epoch[014/300], Step[1150/1602], Avg Loss: 4.8371, Avg Acc: 0.1226
+INFO:master_logger:Epoch[014/300], Step[1150/1602], Avg Loss: 4.8499, Avg Acc: 0.1213
+INFO:local_logger:Epoch[014/300], Step[1200/1602], Avg Loss: 4.8358, Avg Acc: 0.1227
+INFO:local_logger:Epoch[014/300], Step[1200/1602], Avg Loss: 4.8547, Avg Acc: 0.1222
+INFO:local_logger:Epoch[014/300], Step[1200/1602], Avg Loss: 4.8566, Avg Acc: 0.1208
+INFO:local_logger:Epoch[014/300], Step[1200/1602], Avg Loss: 4.8464, Avg Acc: 0.1210
+INFO:master_logger:Epoch[014/300], Step[1200/1602], Avg Loss: 4.8484, Avg Acc: 0.1217
+INFO:local_logger:Epoch[014/300], Step[1250/1602], Avg Loss: 4.8533, Avg Acc: 0.1224
+INFO:local_logger:Epoch[014/300], Step[1250/1602], Avg Loss: 4.8568, Avg Acc: 0.1210
+INFO:local_logger:Epoch[014/300], Step[1250/1602], Avg Loss: 4.8497, Avg Acc: 0.1207
+INFO:local_logger:Epoch[014/300], Step[1250/1602], Avg Loss: 4.8353, Avg Acc: 0.1221
+INFO:master_logger:Epoch[014/300], Step[1250/1602], Avg Loss: 4.8488, Avg Acc: 0.1216
+INFO:local_logger:Epoch[014/300], Step[1300/1602], Avg Loss: 4.8487, Avg Acc: 0.1203
+INFO:local_logger:Epoch[014/300], Step[1300/1602], Avg Loss: 4.8583, Avg Acc: 0.1201
+INFO:local_logger:Epoch[014/300], Step[1300/1602], Avg Loss: 4.8542, Avg Acc: 0.1229
+INFO:local_logger:Epoch[014/300], Step[1300/1602], Avg Loss: 4.8371, Avg Acc: 0.1217
+INFO:master_logger:Epoch[014/300], Step[1300/1602], Avg Loss: 4.8496, Avg Acc: 0.1212
+INFO:local_logger:Epoch[014/300], Step[1350/1602], Avg Loss: 4.8590, Avg Acc: 0.1194
+INFO:local_logger:Epoch[014/300], Step[1350/1602], Avg Loss: 4.8503, Avg Acc: 0.1231
+INFO:local_logger:Epoch[014/300], Step[1350/1602], Avg Loss: 4.8344, Avg Acc: 0.1223
+INFO:local_logger:Epoch[014/300], Step[1350/1602], Avg Loss: 4.8498, Avg Acc: 0.1206
+INFO:master_logger:Epoch[014/300], Step[1350/1602], Avg Loss: 4.8484, Avg Acc: 0.1214
+INFO:local_logger:Epoch[014/300], Step[1400/1602], Avg Loss: 4.8592, Avg Acc: 0.1200
+INFO:local_logger:Epoch[014/300], Step[1400/1602], Avg Loss: 4.8487, Avg Acc: 0.1235
+INFO:local_logger:Epoch[014/300], Step[1400/1602], Avg Loss: 4.8542, Avg Acc: 0.1206
+INFO:local_logger:Epoch[014/300], Step[1400/1602], Avg Loss: 4.8301, Avg Acc: 0.1225
+INFO:master_logger:Epoch[014/300], Step[1400/1602], Avg Loss: 4.8480, Avg Acc: 0.1217
+INFO:local_logger:Epoch[014/300], Step[1450/1602], Avg Loss: 4.8585, Avg Acc: 0.1203
+INFO:local_logger:Epoch[014/300], Step[1450/1602], Avg Loss: 4.8457, Avg Acc: 0.1239
+INFO:local_logger:Epoch[014/300], Step[1450/1602], Avg Loss: 4.8283, Avg Acc: 0.1224
+INFO:master_logger:Epoch[014/300], Step[1450/1602], Avg Loss: 4.8462, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[1450/1602], Avg Loss: 4.8523, Avg Acc: 0.1205
+INFO:local_logger:Epoch[014/300], Step[1500/1602], Avg Loss: 4.8604, Avg Acc: 0.1206
+INFO:local_logger:Epoch[014/300], Step[1500/1602], Avg Loss: 4.8406, Avg Acc: 0.1247
+INFO:local_logger:Epoch[014/300], Step[1500/1602], Avg Loss: 4.8296, Avg Acc: 0.1222
+INFO:local_logger:Epoch[014/300], Step[1500/1602], Avg Loss: 4.8516, Avg Acc: 0.1210
+INFO:master_logger:Epoch[014/300], Step[1500/1602], Avg Loss: 4.8456, Avg Acc: 0.1221
+INFO:local_logger:Epoch[014/300], Step[1550/1602], Avg Loss: 4.8371, Avg Acc: 0.1250
+INFO:local_logger:Epoch[014/300], Step[1550/1602], Avg Loss: 4.8486, Avg Acc: 0.1219
+INFO:local_logger:Epoch[014/300], Step[1550/1602], Avg Loss: 4.8602, Avg Acc: 0.1206
+INFO:local_logger:Epoch[014/300], Step[1550/1602], Avg Loss: 4.8282, Avg Acc: 0.1233
+INFO:master_logger:Epoch[014/300], Step[1550/1602], Avg Loss: 4.8435, Avg Acc: 0.1227
+INFO:local_logger:Epoch[014/300], Step[1600/1602], Avg Loss: 4.8578, Avg Acc: 0.1209
+INFO:local_logger:Epoch[014/300], Step[1600/1602], Avg Loss: 4.8354, Avg Acc: 0.1249
+INFO:local_logger:Epoch[014/300], Step[1600/1602], Avg Loss: 4.8526, Avg Acc: 0.1218
+INFO:local_logger:Epoch[014/300], Step[1600/1602], Avg Loss: 4.8256, Avg Acc: 0.1238
+INFO:master_logger:Epoch[014/300], Step[1600/1602], Avg Loss: 4.8428, Avg Acc: 0.1228
+INFO:local_logger:----- Epoch[014/300], Train Loss: 4.8354, Train Acc: 0.1249, time: 3712.44
+INFO:local_logger:Now training epoch 15. LR=0.000390
+INFO:local_logger:----- Epoch[014/300], Train Loss: 4.8258, Train Acc: 0.1238, time: 3712.61
+INFO:local_logger:Now training epoch 15. LR=0.000390
+INFO:local_logger:----- Epoch[014/300], Train Loss: 4.8527, Train Acc: 0.1218, time: 3712.61
+INFO:local_logger:Now training epoch 15. LR=0.000390
+INFO:local_logger:----- Epoch[014/300], Train Loss: 4.8578, Train Acc: 0.1209, time: 3712.37
+INFO:master_logger:----- Epoch[014/300], Train Loss: 4.8429, Train Acc: 0.1228, time: 3712.37
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-14-Loss-4.85784244341433.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-14-Loss-4.85784244341433.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-14-Loss-4.85784244341433-EMA.pdparams
+INFO:local_logger:Now training epoch 15. LR=0.000390
+INFO:master_logger:Now training epoch 15. LR=0.000390
+INFO:local_logger:Epoch[015/300], Step[0000/1602], Avg Loss: 5.2860, Avg Acc: 0.0150
+INFO:local_logger:Epoch[015/300], Step[0000/1602], Avg Loss: 5.4150, Avg Acc: 0.0500
+INFO:local_logger:Epoch[015/300], Step[0000/1602], Avg Loss: 5.2878, Avg Acc: 0.0150
+INFO:local_logger:Epoch[015/300], Step[0000/1602], Avg Loss: 4.4173, Avg Acc: 0.1950
+INFO:master_logger:Epoch[015/300], Step[0000/1602], Avg Loss: 5.1015, Avg Acc: 0.0687
+INFO:local_logger:Epoch[015/300], Step[0050/1602], Avg Loss: 4.7454, Avg Acc: 0.1173
+INFO:local_logger:Epoch[015/300], Step[0050/1602], Avg Loss: 4.7870, Avg Acc: 0.1250
+INFO:local_logger:Epoch[015/300], Step[0050/1602], Avg Loss: 4.9522, Avg Acc: 0.1223
+INFO:local_logger:Epoch[015/300], Step[0050/1602], Avg Loss: 4.7790, Avg Acc: 0.1200
+INFO:master_logger:Epoch[015/300], Step[0050/1602], Avg Loss: 4.8159, Avg Acc: 0.1211
+INFO:local_logger:Epoch[015/300], Step[0100/1602], Avg Loss: 4.8454, Avg Acc: 0.1236
+INFO:local_logger:Epoch[015/300], Step[0100/1602], Avg Loss: 4.7588, Avg Acc: 0.1395
+INFO:local_logger:Epoch[015/300], Step[0100/1602], Avg Loss: 4.7813, Avg Acc: 0.1295
+INFO:master_logger:Epoch[015/300], Step[0100/1602], Avg Loss: 4.8162, Avg Acc: 0.1280
+INFO:local_logger:Epoch[015/300], Step[0100/1602], Avg Loss: 4.8794, Avg Acc: 0.1197
+INFO:local_logger:Epoch[015/300], Step[0150/1602], Avg Loss: 4.8010, Avg Acc: 0.1276
+INFO:local_logger:Epoch[015/300], Step[0150/1602], Avg Loss: 4.8270, Avg Acc: 0.1264
+INFO:local_logger:Epoch[015/300], Step[0150/1602], Avg Loss: 4.7844, Avg Acc: 0.1387
+INFO:local_logger:Epoch[015/300], Step[0150/1602], Avg Loss: 4.8799, Avg Acc: 0.1209
+INFO:master_logger:Epoch[015/300], Step[0150/1602], Avg Loss: 4.8231, Avg Acc: 0.1284
+INFO:local_logger:Epoch[015/300], Step[0200/1602], Avg Loss: 4.8296, Avg Acc: 0.1268
+INFO:local_logger:Epoch[015/300], Step[0200/1602], Avg Loss: 4.7839, Avg Acc: 0.1367
+INFO:local_logger:Epoch[015/300], Step[0200/1602], Avg Loss: 4.8698, Avg Acc: 0.1253
+INFO:local_logger:Epoch[015/300], Step[0200/1602], Avg Loss: 4.8130, Avg Acc: 0.1290
+INFO:master_logger:Epoch[015/300], Step[0200/1602], Avg Loss: 4.8241, Avg Acc: 0.1295
+INFO:local_logger:Epoch[015/300], Step[0250/1602], Avg Loss: 4.8218, Avg Acc: 0.1294
+INFO:local_logger:Epoch[015/300], Step[0250/1602], Avg Loss: 4.8548, Avg Acc: 0.1268
+INFO:local_logger:Epoch[015/300], Step[0250/1602], Avg Loss: 4.8384, Avg Acc: 0.1219
+INFO:local_logger:Epoch[015/300], Step[0250/1602], Avg Loss: 4.7815, Avg Acc: 0.1357
+INFO:master_logger:Epoch[015/300], Step[0250/1602], Avg Loss: 4.8241, Avg Acc: 0.1284
+INFO:local_logger:Epoch[015/300], Step[0300/1602], Avg Loss: 4.8494, Avg Acc: 0.1234
+INFO:local_logger:Epoch[015/300], Step[0300/1602], Avg Loss: 4.8444, Avg Acc: 0.1275
+INFO:local_logger:Epoch[015/300], Step[0300/1602], Avg Loss: 4.8024, Avg Acc: 0.1356
+INFO:master_logger:Epoch[015/300], Step[0300/1602], Avg Loss: 4.8309, Avg Acc: 0.1289
+INFO:local_logger:Epoch[015/300], Step[0300/1602], Avg Loss: 4.8274, Avg Acc: 0.1291
+INFO:local_logger:Epoch[015/300], Step[0350/1602], Avg Loss: 4.8419, Avg Acc: 0.1241
+INFO:local_logger:Epoch[015/300], Step[0350/1602], Avg Loss: 4.8368, Avg Acc: 0.1281
+INFO:local_logger:Epoch[015/300], Step[0350/1602], Avg Loss: 4.8071, Avg Acc: 0.1347
+INFO:master_logger:Epoch[015/300], Step[0350/1602], Avg Loss: 4.8249, Avg Acc: 0.1292
+INFO:local_logger:Epoch[015/300], Step[0350/1602], Avg Loss: 4.8140, Avg Acc: 0.1301
+INFO:local_logger:Epoch[015/300], Step[0400/1602], Avg Loss: 4.8323, Avg Acc: 0.1258
+INFO:local_logger:Epoch[015/300], Step[0400/1602], Avg Loss: 4.8344, Avg Acc: 0.1256
+INFO:local_logger:Epoch[015/300], Step[0400/1602], Avg Loss: 4.8237, Avg Acc: 0.1273
+INFO:master_logger:Epoch[015/300], Step[0400/1602], Avg Loss: 4.8253, Avg Acc: 0.1285
+INFO:local_logger:Epoch[015/300], Step[0400/1602], Avg Loss: 4.8108, Avg Acc: 0.1351
+INFO:local_logger:Epoch[015/300], Step[0450/1602], Avg Loss: 4.8418, Avg Acc: 0.1269
+INFO:local_logger:Epoch[015/300], Step[0450/1602], Avg Loss: 4.8172, Avg Acc: 0.1349
+INFO:local_logger:Epoch[015/300], Step[0450/1602], Avg Loss: 4.8335, Avg Acc: 0.1264
+INFO:local_logger:Epoch[015/300], Step[0450/1602], Avg Loss: 4.8054, Avg Acc: 0.1296
+INFO:master_logger:Epoch[015/300], Step[0450/1602], Avg Loss: 4.8245, Avg Acc: 0.1294
+INFO:local_logger:Epoch[015/300], Step[0500/1602], Avg Loss: 4.8371, Avg Acc: 0.1286
+INFO:local_logger:Epoch[015/300], Step[0500/1602], Avg Loss: 4.8187, Avg Acc: 0.1347
+INFO:local_logger:Epoch[015/300], Step[0500/1602], Avg Loss: 4.8171, Avg Acc: 0.1283
+INFO:local_logger:Epoch[015/300], Step[0500/1602], Avg Loss: 4.7981, Avg Acc: 0.1297
+INFO:master_logger:Epoch[015/300], Step[0500/1602], Avg Loss: 4.8177, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[0550/1602], Avg Loss: 4.8354, Avg Acc: 0.1291
+INFO:local_logger:Epoch[015/300], Step[0550/1602], Avg Loss: 4.8263, Avg Acc: 0.1330
+INFO:local_logger:Epoch[015/300], Step[0550/1602], Avg Loss: 4.8243, Avg Acc: 0.1291
+INFO:local_logger:Epoch[015/300], Step[0550/1602], Avg Loss: 4.7923, Avg Acc: 0.1298
+INFO:master_logger:Epoch[015/300], Step[0550/1602], Avg Loss: 4.8196, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[0600/1602], Avg Loss: 4.8225, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[0600/1602], Avg Loss: 4.8200, Avg Acc: 0.1335
+INFO:local_logger:Epoch[015/300], Step[0600/1602], Avg Loss: 4.8220, Avg Acc: 0.1283
+INFO:master_logger:Epoch[015/300], Step[0600/1602], Avg Loss: 4.8125, Avg Acc: 0.1309
+INFO:local_logger:Epoch[015/300], Step[0600/1602], Avg Loss: 4.7857, Avg Acc: 0.1315
+INFO:local_logger:Epoch[015/300], Step[0650/1602], Avg Loss: 4.8182, Avg Acc: 0.1290
+INFO:local_logger:Epoch[015/300], Step[0650/1602], Avg Loss: 4.8289, Avg Acc: 0.1279
+INFO:local_logger:Epoch[015/300], Step[0650/1602], Avg Loss: 4.8189, Avg Acc: 0.1330
+INFO:local_logger:Epoch[015/300], Step[0650/1602], Avg Loss: 4.7823, Avg Acc: 0.1310
+INFO:master_logger:Epoch[015/300], Step[0650/1602], Avg Loss: 4.8120, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[0700/1602], Avg Loss: 4.7847, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[0700/1602], Avg Loss: 4.8150, Avg Acc: 0.1297
+INFO:local_logger:Epoch[015/300], Step[0700/1602], Avg Loss: 4.8233, Avg Acc: 0.1272
+INFO:local_logger:Epoch[015/300], Step[0700/1602], Avg Loss: 4.8218, Avg Acc: 0.1320
+INFO:master_logger:Epoch[015/300], Step[0700/1602], Avg Loss: 4.8112, Avg Acc: 0.1299
+INFO:local_logger:Epoch[015/300], Step[0750/1602], Avg Loss: 4.7813, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[0750/1602], Avg Loss: 4.8102, Avg Acc: 0.1294
+INFO:local_logger:Epoch[015/300], Step[0750/1602], Avg Loss: 4.8199, Avg Acc: 0.1267
+INFO:local_logger:Epoch[015/300], Step[0750/1602], Avg Loss: 4.8177, Avg Acc: 0.1325
+INFO:master_logger:Epoch[015/300], Step[0750/1602], Avg Loss: 4.8073, Avg Acc: 0.1297
+INFO:local_logger:Epoch[015/300], Step[0800/1602], Avg Loss: 4.8182, Avg Acc: 0.1323
+INFO:local_logger:Epoch[015/300], Step[0800/1602], Avg Loss: 4.8098, Avg Acc: 0.1287
+INFO:local_logger:Epoch[015/300], Step[0800/1602], Avg Loss: 4.8131, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[0800/1602], Avg Loss: 4.7803, Avg Acc: 0.1305
+INFO:master_logger:Epoch[015/300], Step[0800/1602], Avg Loss: 4.8054, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[0850/1602], Avg Loss: 4.8038, Avg Acc: 0.1318
+INFO:local_logger:Epoch[015/300], Step[0850/1602], Avg Loss: 4.8133, Avg Acc: 0.1274
+INFO:local_logger:Epoch[015/300], Step[0850/1602], Avg Loss: 4.8189, Avg Acc: 0.1308
+INFO:local_logger:Epoch[015/300], Step[0850/1602], Avg Loss: 4.7790, Avg Acc: 0.1303
+INFO:master_logger:Epoch[015/300], Step[0850/1602], Avg Loss: 4.8037, Avg Acc: 0.1301
+INFO:local_logger:Epoch[015/300], Step[0900/1602], Avg Loss: 4.8028, Avg Acc: 0.1326
+INFO:local_logger:Epoch[015/300], Step[0900/1602], Avg Loss: 4.8123, Avg Acc: 0.1298
+INFO:local_logger:Epoch[015/300], Step[0900/1602], Avg Loss: 4.8167, Avg Acc: 0.1281
+INFO:master_logger:Epoch[015/300], Step[0900/1602], Avg Loss: 4.8023, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[0900/1602], Avg Loss: 4.7771, Avg Acc: 0.1308
+INFO:local_logger:Epoch[015/300], Step[0950/1602], Avg Loss: 4.8036, Avg Acc: 0.1316
+INFO:local_logger:Epoch[015/300], Step[0950/1602], Avg Loss: 4.7743, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[0950/1602], Avg Loss: 4.8204, Avg Acc: 0.1282
+INFO:local_logger:Epoch[015/300], Step[0950/1602], Avg Loss: 4.8147, Avg Acc: 0.1294
+INFO:master_logger:Epoch[015/300], Step[0950/1602], Avg Loss: 4.8032, Avg Acc: 0.1299
+INFO:local_logger:Epoch[015/300], Step[1000/1602], Avg Loss: 4.8018, Avg Acc: 0.1322
+INFO:master_logger:Epoch[015/300], Step[1000/1602], Avg Loss: 4.7995, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[1000/1602], Avg Loss: 4.8124, Avg Acc: 0.1292
+INFO:local_logger:Epoch[015/300], Step[1000/1602], Avg Loss: 4.8086, Avg Acc: 0.1304
+INFO:local_logger:Epoch[015/300], Step[1000/1602], Avg Loss: 4.7751, Avg Acc: 0.1304
+INFO:local_logger:Epoch[015/300], Step[1050/1602], Avg Loss: 4.7936, Avg Acc: 0.1315
+INFO:local_logger:Epoch[015/300], Step[1050/1602], Avg Loss: 4.7709, Avg Acc: 0.1308
+INFO:local_logger:Epoch[015/300], Step[1050/1602], Avg Loss: 4.8006, Avg Acc: 0.1311
+INFO:master_logger:Epoch[015/300], Step[1050/1602], Avg Loss: 4.7937, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[1050/1602], Avg Loss: 4.8099, Avg Acc: 0.1279
+INFO:local_logger:Epoch[015/300], Step[1100/1602], Avg Loss: 4.7888, Avg Acc: 0.1308
+INFO:master_logger:Epoch[015/300], Step[1100/1602], Avg Loss: 4.7906, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1100/1602], Avg Loss: 4.7725, Avg Acc: 0.1300
+INFO:local_logger:Epoch[015/300], Step[1100/1602], Avg Loss: 4.8046, Avg Acc: 0.1289
+INFO:local_logger:Epoch[015/300], Step[1100/1602], Avg Loss: 4.7963, Avg Acc: 0.1312
+INFO:local_logger:Epoch[015/300], Step[1150/1602], Avg Loss: 4.7894, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1150/1602], Avg Loss: 4.7953, Avg Acc: 0.1314
+INFO:local_logger:Epoch[015/300], Step[1150/1602], Avg Loss: 4.8035, Avg Acc: 0.1289
+INFO:local_logger:Epoch[015/300], Step[1150/1602], Avg Loss: 4.7732, Avg Acc: 0.1301
+INFO:master_logger:Epoch[015/300], Step[1150/1602], Avg Loss: 4.7904, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1200/1602], Avg Loss: 4.7692, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1200/1602], Avg Loss: 4.7852, Avg Acc: 0.1307
+INFO:local_logger:Epoch[015/300], Step[1200/1602], Avg Loss: 4.7993, Avg Acc: 0.1297
+INFO:local_logger:Epoch[015/300], Step[1200/1602], Avg Loss: 4.7943, Avg Acc: 0.1310
+INFO:master_logger:Epoch[015/300], Step[1200/1602], Avg Loss: 4.7870, Avg Acc: 0.1304
+INFO:local_logger:Epoch[015/300], Step[1250/1602], Avg Loss: 4.7851, Avg Acc: 0.1307
+INFO:master_logger:Epoch[015/300], Step[1250/1602], Avg Loss: 4.7855, Avg Acc: 0.1307
+INFO:local_logger:Epoch[015/300], Step[1250/1602], Avg Loss: 4.7675, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1250/1602], Avg Loss: 4.7946, Avg Acc: 0.1314
+INFO:local_logger:Epoch[015/300], Step[1250/1602], Avg Loss: 4.7946, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1300/1602], Avg Loss: 4.7810, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1300/1602], Avg Loss: 4.7677, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1300/1602], Avg Loss: 4.7986, Avg Acc: 0.1294
+INFO:master_logger:Epoch[015/300], Step[1300/1602], Avg Loss: 4.7848, Avg Acc: 0.1301
+INFO:local_logger:Epoch[015/300], Step[1300/1602], Avg Loss: 4.7917, Avg Acc: 0.1307
+INFO:local_logger:Epoch[015/300], Step[1350/1602], Avg Loss: 4.7812, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[1350/1602], Avg Loss: 4.7914, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[1350/1602], Avg Loss: 4.7957, Avg Acc: 0.1303
+INFO:master_logger:Epoch[015/300], Step[1350/1602], Avg Loss: 4.7830, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1350/1602], Avg Loss: 4.7639, Avg Acc: 0.1309
+INFO:local_logger:Epoch[015/300], Step[1400/1602], Avg Loss: 4.7667, Avg Acc: 0.1302
+INFO:local_logger:Epoch[015/300], Step[1400/1602], Avg Loss: 4.7942, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[1400/1602], Avg Loss: 4.7895, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1400/1602], Avg Loss: 4.7830, Avg Acc: 0.1309
+INFO:master_logger:Epoch[015/300], Step[1400/1602], Avg Loss: 4.7834, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[1450/1602], Avg Loss: 4.7825, Avg Acc: 0.1306
+INFO:master_logger:Epoch[015/300], Step[1450/1602], Avg Loss: 4.7840, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1450/1602], Avg Loss: 4.7696, Avg Acc: 0.1301
+INFO:local_logger:Epoch[015/300], Step[1450/1602], Avg Loss: 4.7890, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[1450/1602], Avg Loss: 4.7950, Avg Acc: 0.1308
+INFO:local_logger:Epoch[015/300], Step[1500/1602], Avg Loss: 4.7814, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1500/1602], Avg Loss: 4.7673, Avg Acc: 0.1303
+INFO:local_logger:Epoch[015/300], Step[1500/1602], Avg Loss: 4.7942, Avg Acc: 0.1312
+INFO:local_logger:Epoch[015/300], Step[1500/1602], Avg Loss: 4.7864, Avg Acc: 0.1300
+INFO:master_logger:Epoch[015/300], Step[1500/1602], Avg Loss: 4.7823, Avg Acc: 0.1305
+INFO:local_logger:Epoch[015/300], Step[1550/1602], Avg Loss: 4.7812, Avg Acc: 0.1304
+INFO:local_logger:Epoch[015/300], Step[1550/1602], Avg Loss: 4.7647, Avg Acc: 0.1303
+INFO:master_logger:Epoch[015/300], Step[1550/1602], Avg Loss: 4.7806, Avg Acc: 0.1307
+INFO:local_logger:Epoch[015/300], Step[1550/1602], Avg Loss: 4.7923, Avg Acc: 0.1317
+INFO:local_logger:Epoch[015/300], Step[1550/1602], Avg Loss: 4.7840, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[1600/1602], Avg Loss: 4.7827, Avg Acc: 0.1311
+INFO:local_logger:Epoch[015/300], Step[1600/1602], Avg Loss: 4.7672, Avg Acc: 0.1306
+INFO:local_logger:Epoch[015/300], Step[1600/1602], Avg Loss: 4.7875, Avg Acc: 0.1311
+INFO:local_logger:Epoch[015/300], Step[1600/1602], Avg Loss: 4.7823, Avg Acc: 0.1303
+INFO:master_logger:Epoch[015/300], Step[1600/1602], Avg Loss: 4.7799, Avg Acc: 0.1308
+INFO:local_logger:----- Epoch[015/300], Train Loss: 4.7821, Train Acc: 0.1303, time: 3710.01
+INFO:master_logger:----- Epoch[015/300], Train Loss: 4.7799, Train Acc: 0.1308, time: 3710.01
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-15-Loss-4.782081501913841.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-15-Loss-4.782081501913841.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-15-Loss-4.782081501913841-EMA.pdparams
+INFO:local_logger:Now training epoch 16. LR=0.000389
+INFO:master_logger:Now training epoch 16. LR=0.000389
+INFO:local_logger:----- Epoch[015/300], Train Loss: 4.7671, Train Acc: 0.1306, time: 3710.75
+INFO:local_logger:----- Epoch[015/300], Train Loss: 4.7876, Train Acc: 0.1311, time: 3710.59
+INFO:local_logger:----- Epoch[015/300], Train Loss: 4.7828, Train Acc: 0.1311, time: 3710.58
+INFO:local_logger:Now training epoch 16. LR=0.000389
+INFO:local_logger:Now training epoch 16. LR=0.000389
+INFO:local_logger:Now training epoch 16. LR=0.000389
+INFO:local_logger:Epoch[016/300], Step[0000/1602], Avg Loss: 5.2125, Avg Acc: 0.0900
+INFO:local_logger:Epoch[016/300], Step[0000/1602], Avg Loss: 5.3903, Avg Acc: 0.0300
+INFO:local_logger:Epoch[016/300], Step[0000/1602], Avg Loss: 5.2192, Avg Acc: 0.1200
+INFO:local_logger:Epoch[016/300], Step[0000/1602], Avg Loss: 3.9569, Avg Acc: 0.3250
+INFO:master_logger:Epoch[016/300], Step[0000/1602], Avg Loss: 4.9447, Avg Acc: 0.1412
+INFO:local_logger:Epoch[016/300], Step[0050/1602], Avg Loss: 4.6653, Avg Acc: 0.1152
+INFO:local_logger:Epoch[016/300], Step[0050/1602], Avg Loss: 4.7212, Avg Acc: 0.1173
+INFO:local_logger:Epoch[016/300], Step[0050/1602], Avg Loss: 4.6433, Avg Acc: 0.1580
+INFO:local_logger:Epoch[016/300], Step[0050/1602], Avg Loss: 4.7893, Avg Acc: 0.1352
+INFO:master_logger:Epoch[016/300], Step[0050/1602], Avg Loss: 4.7048, Avg Acc: 0.1314
+INFO:local_logger:Epoch[016/300], Step[0100/1602], Avg Loss: 4.7905, Avg Acc: 0.1313
+INFO:local_logger:Epoch[016/300], Step[0100/1602], Avg Loss: 4.7246, Avg Acc: 0.1303
+INFO:local_logger:Epoch[016/300], Step[0100/1602], Avg Loss: 4.6731, Avg Acc: 0.1212
+INFO:local_logger:Epoch[016/300], Step[0100/1602], Avg Loss: 4.6493, Avg Acc: 0.1477
+INFO:master_logger:Epoch[016/300], Step[0100/1602], Avg Loss: 4.7094, Avg Acc: 0.1326
+INFO:local_logger:Epoch[016/300], Step[0150/1602], Avg Loss: 4.7397, Avg Acc: 0.1316
+INFO:local_logger:Epoch[016/300], Step[0150/1602], Avg Loss: 4.7001, Avg Acc: 0.1290
+INFO:local_logger:Epoch[016/300], Step[0150/1602], Avg Loss: 4.6752, Avg Acc: 0.1288
+INFO:local_logger:Epoch[016/300], Step[0150/1602], Avg Loss: 4.6765, Avg Acc: 0.1375
+INFO:master_logger:Epoch[016/300], Step[0150/1602], Avg Loss: 4.6979, Avg Acc: 0.1317
+INFO:local_logger:Epoch[016/300], Step[0200/1602], Avg Loss: 4.7279, Avg Acc: 0.1336
+INFO:local_logger:Epoch[016/300], Step[0200/1602], Avg Loss: 4.6899, Avg Acc: 0.1295
+INFO:local_logger:Epoch[016/300], Step[0200/1602], Avg Loss: 4.6788, Avg Acc: 0.1386
+INFO:local_logger:Epoch[016/300], Step[0200/1602], Avg Loss: 4.7103, Avg Acc: 0.1304
+INFO:master_logger:Epoch[016/300], Step[0200/1602], Avg Loss: 4.7017, Avg Acc: 0.1330
+INFO:local_logger:Epoch[016/300], Step[0250/1602], Avg Loss: 4.6799, Avg Acc: 0.1302
+INFO:local_logger:Epoch[016/300], Step[0250/1602], Avg Loss: 4.6714, Avg Acc: 0.1393
+INFO:local_logger:Epoch[016/300], Step[0250/1602], Avg Loss: 4.6918, Avg Acc: 0.1289
+INFO:master_logger:Epoch[016/300], Step[0250/1602], Avg Loss: 4.6908, Avg Acc: 0.1331
+INFO:local_logger:Epoch[016/300], Step[0250/1602], Avg Loss: 4.7201, Avg Acc: 0.1340
+INFO:local_logger:Epoch[016/300], Step[0300/1602], Avg Loss: 4.6954, Avg Acc: 0.1336
+INFO:local_logger:Epoch[016/300], Step[0300/1602], Avg Loss: 4.7346, Avg Acc: 0.1334
+INFO:local_logger:Epoch[016/300], Step[0300/1602], Avg Loss: 4.6761, Avg Acc: 0.1394
+INFO:local_logger:Epoch[016/300], Step[0300/1602], Avg Loss: 4.6862, Avg Acc: 0.1302
+INFO:master_logger:Epoch[016/300], Step[0300/1602], Avg Loss: 4.6981, Avg Acc: 0.1341
+INFO:local_logger:Epoch[016/300], Step[0350/1602], Avg Loss: 4.7186, Avg Acc: 0.1333
+INFO:local_logger:Epoch[016/300], Step[0350/1602], Avg Loss: 4.6751, Avg Acc: 0.1357
+INFO:local_logger:Epoch[016/300], Step[0350/1602], Avg Loss: 4.7248, Avg Acc: 0.1368
+INFO:master_logger:Epoch[016/300], Step[0350/1602], Avg Loss: 4.7006, Avg Acc: 0.1347
+INFO:local_logger:Epoch[016/300], Step[0350/1602], Avg Loss: 4.6839, Avg Acc: 0.1329
+INFO:local_logger:Epoch[016/300], Step[0400/1602], Avg Loss: 4.7107, Avg Acc: 0.1333
+INFO:local_logger:Epoch[016/300], Step[0400/1602], Avg Loss: 4.6977, Avg Acc: 0.1322
+INFO:local_logger:Epoch[016/300], Step[0400/1602], Avg Loss: 4.6877, Avg Acc: 0.1322
+INFO:local_logger:Epoch[016/300], Step[0400/1602], Avg Loss: 4.7350, Avg Acc: 0.1385
+INFO:master_logger:Epoch[016/300], Step[0400/1602], Avg Loss: 4.7078, Avg Acc: 0.1341
+INFO:local_logger:Epoch[016/300], Step[0450/1602], Avg Loss: 4.6833, Avg Acc: 0.1339
+INFO:local_logger:Epoch[016/300], Step[0450/1602], Avg Loss: 4.7106, Avg Acc: 0.1331
+INFO:local_logger:Epoch[016/300], Step[0450/1602], Avg Loss: 4.6851, Avg Acc: 0.1337
+INFO:local_logger:Epoch[016/300], Step[0450/1602], Avg Loss: 4.7342, Avg Acc: 0.1356
+INFO:master_logger:Epoch[016/300], Step[0450/1602], Avg Loss: 4.7033, Avg Acc: 0.1341
+INFO:local_logger:Epoch[016/300], Step[0500/1602], Avg Loss: 4.7216, Avg Acc: 0.1317
+INFO:local_logger:Epoch[016/300], Step[0500/1602], Avg Loss: 4.6743, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[0500/1602], Avg Loss: 4.6752, Avg Acc: 0.1340
+INFO:local_logger:Epoch[016/300], Step[0500/1602], Avg Loss: 4.7284, Avg Acc: 0.1372
+INFO:master_logger:Epoch[016/300], Step[0500/1602], Avg Loss: 4.6999, Avg Acc: 0.1344
+INFO:local_logger:Epoch[016/300], Step[0550/1602], Avg Loss: 4.6697, Avg Acc: 0.1360
+INFO:local_logger:Epoch[016/300], Step[0550/1602], Avg Loss: 4.7166, Avg Acc: 0.1317
+INFO:local_logger:Epoch[016/300], Step[0550/1602], Avg Loss: 4.6962, Avg Acc: 0.1331
+INFO:local_logger:Epoch[016/300], Step[0550/1602], Avg Loss: 4.7365, Avg Acc: 0.1357
+INFO:master_logger:Epoch[016/300], Step[0550/1602], Avg Loss: 4.7047, Avg Acc: 0.1341
+INFO:local_logger:Epoch[016/300], Step[0600/1602], Avg Loss: 4.7377, Avg Acc: 0.1362
+INFO:local_logger:Epoch[016/300], Step[0600/1602], Avg Loss: 4.7209, Avg Acc: 0.1331
+INFO:local_logger:Epoch[016/300], Step[0600/1602], Avg Loss: 4.7104, Avg Acc: 0.1317
+INFO:local_logger:Epoch[016/300], Step[0600/1602], Avg Loss: 4.6747, Avg Acc: 0.1354
+INFO:master_logger:Epoch[016/300], Step[0600/1602], Avg Loss: 4.7109, Avg Acc: 0.1341
+INFO:local_logger:Epoch[016/300], Step[0650/1602], Avg Loss: 4.6734, Avg Acc: 0.1351
+INFO:local_logger:Epoch[016/300], Step[0650/1602], Avg Loss: 4.7169, Avg Acc: 0.1332
+INFO:local_logger:Epoch[016/300], Step[0650/1602], Avg Loss: 4.7356, Avg Acc: 0.1360
+INFO:local_logger:Epoch[016/300], Step[0650/1602], Avg Loss: 4.7004, Avg Acc: 0.1336
+INFO:master_logger:Epoch[016/300], Step[0650/1602], Avg Loss: 4.7066, Avg Acc: 0.1345
+INFO:local_logger:Epoch[016/300], Step[0700/1602], Avg Loss: 4.7024, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[0700/1602], Avg Loss: 4.7117, Avg Acc: 0.1357
+INFO:local_logger:Epoch[016/300], Step[0700/1602], Avg Loss: 4.6724, Avg Acc: 0.1365
+INFO:local_logger:Epoch[016/300], Step[0700/1602], Avg Loss: 4.7334, Avg Acc: 0.1360
+INFO:master_logger:Epoch[016/300], Step[0700/1602], Avg Loss: 4.7050, Avg Acc: 0.1359
+INFO:local_logger:Epoch[016/300], Step[0750/1602], Avg Loss: 4.7067, Avg Acc: 0.1339
+INFO:local_logger:Epoch[016/300], Step[0750/1602], Avg Loss: 4.6782, Avg Acc: 0.1367
+INFO:local_logger:Epoch[016/300], Step[0750/1602], Avg Loss: 4.7141, Avg Acc: 0.1348
+INFO:master_logger:Epoch[016/300], Step[0750/1602], Avg Loss: 4.7068, Avg Acc: 0.1352
+INFO:local_logger:Epoch[016/300], Step[0750/1602], Avg Loss: 4.7283, Avg Acc: 0.1355
+INFO:local_logger:Epoch[016/300], Step[0800/1602], Avg Loss: 4.6799, Avg Acc: 0.1363
+INFO:local_logger:Epoch[016/300], Step[0800/1602], Avg Loss: 4.7138, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[0800/1602], Avg Loss: 4.7025, Avg Acc: 0.1334
+INFO:local_logger:Epoch[016/300], Step[0800/1602], Avg Loss: 4.7235, Avg Acc: 0.1347
+INFO:master_logger:Epoch[016/300], Step[0800/1602], Avg Loss: 4.7049, Avg Acc: 0.1350
+INFO:local_logger:Epoch[016/300], Step[0850/1602], Avg Loss: 4.7164, Avg Acc: 0.1344
+INFO:local_logger:Epoch[016/300], Step[0850/1602], Avg Loss: 4.7194, Avg Acc: 0.1351
+INFO:local_logger:Epoch[016/300], Step[0850/1602], Avg Loss: 4.6835, Avg Acc: 0.1354
+INFO:local_logger:Epoch[016/300], Step[0850/1602], Avg Loss: 4.7051, Avg Acc: 0.1333
+INFO:master_logger:Epoch[016/300], Step[0850/1602], Avg Loss: 4.7061, Avg Acc: 0.1345
+INFO:local_logger:Epoch[016/300], Step[0900/1602], Avg Loss: 4.7203, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[0900/1602], Avg Loss: 4.6835, Avg Acc: 0.1366
+INFO:local_logger:Epoch[016/300], Step[0900/1602], Avg Loss: 4.7038, Avg Acc: 0.1331
+INFO:local_logger:Epoch[016/300], Step[0900/1602], Avg Loss: 4.7218, Avg Acc: 0.1353
+INFO:master_logger:Epoch[016/300], Step[0900/1602], Avg Loss: 4.7073, Avg Acc: 0.1351
+INFO:local_logger:Epoch[016/300], Step[0950/1602], Avg Loss: 4.7263, Avg Acc: 0.1349
+INFO:local_logger:Epoch[016/300], Step[0950/1602], Avg Loss: 4.6857, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[0950/1602], Avg Loss: 4.7291, Avg Acc: 0.1345
+INFO:local_logger:Epoch[016/300], Step[0950/1602], Avg Loss: 4.7028, Avg Acc: 0.1334
+INFO:master_logger:Epoch[016/300], Step[0950/1602], Avg Loss: 4.7110, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[1000/1602], Avg Loss: 4.7253, Avg Acc: 0.1349
+INFO:local_logger:Epoch[016/300], Step[1000/1602], Avg Loss: 4.7015, Avg Acc: 0.1338
+INFO:local_logger:Epoch[016/300], Step[1000/1602], Avg Loss: 4.6798, Avg Acc: 0.1350
+INFO:local_logger:Epoch[016/300], Step[1000/1602], Avg Loss: 4.7278, Avg Acc: 0.1346
+INFO:master_logger:Epoch[016/300], Step[1000/1602], Avg Loss: 4.7086, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[1050/1602], Avg Loss: 4.7254, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[1050/1602], Avg Loss: 4.7252, Avg Acc: 0.1352
+INFO:local_logger:Epoch[016/300], Step[1050/1602], Avg Loss: 4.7028, Avg Acc: 0.1347
+INFO:local_logger:Epoch[016/300], Step[1050/1602], Avg Loss: 4.6859, Avg Acc: 0.1354
+INFO:master_logger:Epoch[016/300], Step[1050/1602], Avg Loss: 4.7098, Avg Acc: 0.1351
+INFO:local_logger:Epoch[016/300], Step[1100/1602], Avg Loss: 4.7220, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[1100/1602], Avg Loss: 4.7247, Avg Acc: 0.1357
+INFO:local_logger:Epoch[016/300], Step[1100/1602], Avg Loss: 4.6841, Avg Acc: 0.1356
+INFO:local_logger:Epoch[016/300], Step[1100/1602], Avg Loss: 4.7045, Avg Acc: 0.1348
+INFO:master_logger:Epoch[016/300], Step[1100/1602], Avg Loss: 4.7089, Avg Acc: 0.1352
+INFO:local_logger:Epoch[016/300], Step[1150/1602], Avg Loss: 4.7242, Avg Acc: 0.1361
+INFO:local_logger:Epoch[016/300], Step[1150/1602], Avg Loss: 4.7199, Avg Acc: 0.1337
+INFO:local_logger:Epoch[016/300], Step[1150/1602], Avg Loss: 4.6869, Avg Acc: 0.1353
+INFO:master_logger:Epoch[016/300], Step[1150/1602], Avg Loss: 4.7094, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[1150/1602], Avg Loss: 4.7065, Avg Acc: 0.1334
+INFO:local_logger:Epoch[016/300], Step[1200/1602], Avg Loss: 4.7165, Avg Acc: 0.1342
+INFO:local_logger:Epoch[016/300], Step[1200/1602], Avg Loss: 4.6865, Avg Acc: 0.1349
+INFO:local_logger:Epoch[016/300], Step[1200/1602], Avg Loss: 4.7252, Avg Acc: 0.1360
+INFO:local_logger:Epoch[016/300], Step[1200/1602], Avg Loss: 4.7070, Avg Acc: 0.1324
+INFO:master_logger:Epoch[016/300], Step[1200/1602], Avg Loss: 4.7088, Avg Acc: 0.1344
+INFO:local_logger:Epoch[016/300], Step[1250/1602], Avg Loss: 4.7214, Avg Acc: 0.1359
+INFO:local_logger:Epoch[016/300], Step[1250/1602], Avg Loss: 4.7158, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[1250/1602], Avg Loss: 4.6895, Avg Acc: 0.1348
+INFO:local_logger:Epoch[016/300], Step[1250/1602], Avg Loss: 4.7067, Avg Acc: 0.1324
+INFO:master_logger:Epoch[016/300], Step[1250/1602], Avg Loss: 4.7083, Avg Acc: 0.1344
+INFO:local_logger:Epoch[016/300], Step[1300/1602], Avg Loss: 4.7130, Avg Acc: 0.1349
+INFO:local_logger:Epoch[016/300], Step[1300/1602], Avg Loss: 4.7081, Avg Acc: 0.1326
+INFO:local_logger:Epoch[016/300], Step[1300/1602], Avg Loss: 4.6901, Avg Acc: 0.1348
+INFO:master_logger:Epoch[016/300], Step[1300/1602], Avg Loss: 4.7071, Avg Acc: 0.1348
+INFO:local_logger:Epoch[016/300], Step[1300/1602], Avg Loss: 4.7171, Avg Acc: 0.1369
+INFO:local_logger:Epoch[016/300], Step[1350/1602], Avg Loss: 4.7172, Avg Acc: 0.1368
+INFO:local_logger:Epoch[016/300], Step[1350/1602], Avg Loss: 4.7144, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[1350/1602], Avg Loss: 4.7115, Avg Acc: 0.1320
+INFO:local_logger:Epoch[016/300], Step[1350/1602], Avg Loss: 4.6889, Avg Acc: 0.1347
+INFO:master_logger:Epoch[016/300], Step[1350/1602], Avg Loss: 4.7080, Avg Acc: 0.1347
+INFO:local_logger:Epoch[016/300], Step[1400/1602], Avg Loss: 4.7148, Avg Acc: 0.1371
+INFO:local_logger:Epoch[016/300], Step[1400/1602], Avg Loss: 4.7163, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[1400/1602], Avg Loss: 4.7084, Avg Acc: 0.1318
+INFO:local_logger:Epoch[016/300], Step[1400/1602], Avg Loss: 4.6877, Avg Acc: 0.1352
+INFO:master_logger:Epoch[016/300], Step[1400/1602], Avg Loss: 4.7068, Avg Acc: 0.1349
+INFO:local_logger:Epoch[016/300], Step[1450/1602], Avg Loss: 4.7200, Avg Acc: 0.1352
+INFO:master_logger:Epoch[016/300], Step[1450/1602], Avg Loss: 4.7087, Avg Acc: 0.1353
+INFO:local_logger:Epoch[016/300], Step[1450/1602], Avg Loss: 4.7168, Avg Acc: 0.1374
+INFO:local_logger:Epoch[016/300], Step[1450/1602], Avg Loss: 4.7078, Avg Acc: 0.1329
+INFO:local_logger:Epoch[016/300], Step[1450/1602], Avg Loss: 4.6903, Avg Acc: 0.1356
+INFO:local_logger:Epoch[016/300], Step[1500/1602], Avg Loss: 4.7197, Avg Acc: 0.1376
+INFO:local_logger:Epoch[016/300], Step[1500/1602], Avg Loss: 4.7177, Avg Acc: 0.1346
+INFO:local_logger:Epoch[016/300], Step[1500/1602], Avg Loss: 4.6903, Avg Acc: 0.1359
+INFO:master_logger:Epoch[016/300], Step[1500/1602], Avg Loss: 4.7092, Avg Acc: 0.1354
+INFO:local_logger:Epoch[016/300], Step[1500/1602], Avg Loss: 4.7093, Avg Acc: 0.1333
+INFO:local_logger:Epoch[016/300], Step[1550/1602], Avg Loss: 4.7078, Avg Acc: 0.1342
+INFO:local_logger:Epoch[016/300], Step[1550/1602], Avg Loss: 4.7135, Avg Acc: 0.1344
+INFO:local_logger:Epoch[016/300], Step[1550/1602], Avg Loss: 4.6893, Avg Acc: 0.1360
+INFO:local_logger:Epoch[016/300], Step[1550/1602], Avg Loss: 4.7206, Avg Acc: 0.1374
+INFO:master_logger:Epoch[016/300], Step[1550/1602], Avg Loss: 4.7078, Avg Acc: 0.1355
+INFO:local_logger:Epoch[016/300], Step[1600/1602], Avg Loss: 4.6896, Avg Acc: 0.1367
+INFO:local_logger:Epoch[016/300], Step[1600/1602], Avg Loss: 4.7130, Avg Acc: 0.1335
+INFO:local_logger:Epoch[016/300], Step[1600/1602], Avg Loss: 4.7144, Avg Acc: 0.1344
+INFO:master_logger:Epoch[016/300], Step[1600/1602], Avg Loss: 4.7091, Avg Acc: 0.1355
+INFO:local_logger:Epoch[016/300], Step[1600/1602], Avg Loss: 4.7194, Avg Acc: 0.1373
+INFO:local_logger:----- Epoch[016/300], Train Loss: 4.6895, Train Acc: 0.1368, time: 3729.11
+INFO:local_logger:Now training epoch 17. LR=0.000389
+INFO:local_logger:----- Epoch[016/300], Train Loss: 4.7132, Train Acc: 0.1334, time: 3729.12
+INFO:local_logger:Now training epoch 17. LR=0.000389
+INFO:local_logger:----- Epoch[016/300], Train Loss: 4.7192, Train Acc: 0.1372, time: 3729.18
+INFO:local_logger:Now training epoch 17. LR=0.000389
+INFO:local_logger:----- Epoch[016/300], Train Loss: 4.7142, Train Acc: 0.1344, time: 3729.27
+INFO:master_logger:----- Epoch[016/300], Train Loss: 4.7090, Train Acc: 0.1355, time: 3729.27
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-16-Loss-4.714189625979765.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-16-Loss-4.714189625979765.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-16-Loss-4.714189625979765-EMA.pdparams
+INFO:local_logger:Now training epoch 17. LR=0.000389
+INFO:master_logger:Now training epoch 17. LR=0.000389
+INFO:local_logger:Epoch[017/300], Step[0000/1602], Avg Loss: 5.0216, Avg Acc: 0.1000
+INFO:local_logger:Epoch[017/300], Step[0000/1602], Avg Loss: 4.1410, Avg Acc: 0.3300
+INFO:master_logger:Epoch[017/300], Step[0000/1602], Avg Loss: 4.2044, Avg Acc: 0.2475
+INFO:local_logger:Epoch[017/300], Step[0000/1602], Avg Loss: 3.9446, Avg Acc: 0.2550
+INFO:local_logger:Epoch[017/300], Step[0000/1602], Avg Loss: 3.7107, Avg Acc: 0.3050
+INFO:local_logger:Epoch[017/300], Step[0050/1602], Avg Loss: 4.6872, Avg Acc: 0.1427
+INFO:local_logger:Epoch[017/300], Step[0050/1602], Avg Loss: 4.6440, Avg Acc: 0.1343
+INFO:local_logger:Epoch[017/300], Step[0050/1602], Avg Loss: 4.6876, Avg Acc: 0.1390
+INFO:local_logger:Epoch[017/300], Step[0050/1602], Avg Loss: 4.7981, Avg Acc: 0.1192
+INFO:master_logger:Epoch[017/300], Step[0050/1602], Avg Loss: 4.7042, Avg Acc: 0.1338
+INFO:local_logger:Epoch[017/300], Step[0100/1602], Avg Loss: 4.6855, Avg Acc: 0.1416
+INFO:local_logger:Epoch[017/300], Step[0100/1602], Avg Loss: 4.6309, Avg Acc: 0.1374
+INFO:local_logger:Epoch[017/300], Step[0100/1602], Avg Loss: 4.6889, Avg Acc: 0.1294
+INFO:master_logger:Epoch[017/300], Step[0100/1602], Avg Loss: 4.6977, Avg Acc: 0.1322
+INFO:local_logger:Epoch[017/300], Step[0100/1602], Avg Loss: 4.7857, Avg Acc: 0.1204
+INFO:local_logger:Epoch[017/300], Step[0150/1602], Avg Loss: 4.7214, Avg Acc: 0.1364
+INFO:local_logger:Epoch[017/300], Step[0150/1602], Avg Loss: 4.6366, Avg Acc: 0.1370
+INFO:local_logger:Epoch[017/300], Step[0150/1602], Avg Loss: 4.7127, Avg Acc: 0.1333
+INFO:local_logger:Epoch[017/300], Step[0150/1602], Avg Loss: 4.7058, Avg Acc: 0.1240
+INFO:master_logger:Epoch[017/300], Step[0150/1602], Avg Loss: 4.6941, Avg Acc: 0.1327
+INFO:local_logger:Epoch[017/300], Step[0200/1602], Avg Loss: 4.7093, Avg Acc: 0.1319
+INFO:local_logger:Epoch[017/300], Step[0200/1602], Avg Loss: 4.7062, Avg Acc: 0.1380
+INFO:local_logger:Epoch[017/300], Step[0200/1602], Avg Loss: 4.6489, Avg Acc: 0.1411
+INFO:local_logger:Epoch[017/300], Step[0200/1602], Avg Loss: 4.7454, Avg Acc: 0.1232
+INFO:master_logger:Epoch[017/300], Step[0200/1602], Avg Loss: 4.7025, Avg Acc: 0.1336
+INFO:local_logger:Epoch[017/300], Step[0250/1602], Avg Loss: 4.6983, Avg Acc: 0.1354
+INFO:local_logger:Epoch[017/300], Step[0250/1602], Avg Loss: 4.6384, Avg Acc: 0.1409
+INFO:local_logger:Epoch[017/300], Step[0250/1602], Avg Loss: 4.6929, Avg Acc: 0.1369
+INFO:local_logger:Epoch[017/300], Step[0250/1602], Avg Loss: 4.7425, Avg Acc: 0.1292
+INFO:master_logger:Epoch[017/300], Step[0250/1602], Avg Loss: 4.6930, Avg Acc: 0.1356
+INFO:local_logger:Epoch[017/300], Step[0300/1602], Avg Loss: 4.7005, Avg Acc: 0.1368
+INFO:local_logger:Epoch[017/300], Step[0300/1602], Avg Loss: 4.6327, Avg Acc: 0.1400
+INFO:local_logger:Epoch[017/300], Step[0300/1602], Avg Loss: 4.6657, Avg Acc: 0.1393
+INFO:local_logger:Epoch[017/300], Step[0300/1602], Avg Loss: 4.7521, Avg Acc: 0.1288
+INFO:master_logger:Epoch[017/300], Step[0300/1602], Avg Loss: 4.6877, Avg Acc: 0.1362
+INFO:local_logger:Epoch[017/300], Step[0350/1602], Avg Loss: 4.6889, Avg Acc: 0.1386
+INFO:local_logger:Epoch[017/300], Step[0350/1602], Avg Loss: 4.6289, Avg Acc: 0.1451
+INFO:local_logger:Epoch[017/300], Step[0350/1602], Avg Loss: 4.7625, Avg Acc: 0.1270
+INFO:local_logger:Epoch[017/300], Step[0350/1602], Avg Loss: 4.6654, Avg Acc: 0.1398
+INFO:master_logger:Epoch[017/300], Step[0350/1602], Avg Loss: 4.6864, Avg Acc: 0.1376
+INFO:local_logger:Epoch[017/300], Step[0400/1602], Avg Loss: 4.6981, Avg Acc: 0.1363
+INFO:local_logger:Epoch[017/300], Step[0400/1602], Avg Loss: 4.6245, Avg Acc: 0.1448
+INFO:local_logger:Epoch[017/300], Step[0400/1602], Avg Loss: 4.6596, Avg Acc: 0.1448
+INFO:local_logger:Epoch[017/300], Step[0400/1602], Avg Loss: 4.7634, Avg Acc: 0.1294
+INFO:master_logger:Epoch[017/300], Step[0400/1602], Avg Loss: 4.6864, Avg Acc: 0.1388
+INFO:local_logger:Epoch[017/300], Step[0450/1602], Avg Loss: 4.6929, Avg Acc: 0.1365
+INFO:local_logger:Epoch[017/300], Step[0450/1602], Avg Loss: 4.7501, Avg Acc: 0.1324
+INFO:local_logger:Epoch[017/300], Step[0450/1602], Avg Loss: 4.6355, Avg Acc: 0.1427
+INFO:local_logger:Epoch[017/300], Step[0450/1602], Avg Loss: 4.6649, Avg Acc: 0.1451
+INFO:master_logger:Epoch[017/300], Step[0450/1602], Avg Loss: 4.6859, Avg Acc: 0.1392
+INFO:local_logger:Epoch[017/300], Step[0500/1602], Avg Loss: 4.7449, Avg Acc: 0.1336
+INFO:local_logger:Epoch[017/300], Step[0500/1602], Avg Loss: 4.6849, Avg Acc: 0.1355
+INFO:local_logger:Epoch[017/300], Step[0500/1602], Avg Loss: 4.6467, Avg Acc: 0.1432
+INFO:local_logger:Epoch[017/300], Step[0500/1602], Avg Loss: 4.6623, Avg Acc: 0.1446
+INFO:master_logger:Epoch[017/300], Step[0500/1602], Avg Loss: 4.6847, Avg Acc: 0.1392
+INFO:local_logger:Epoch[017/300], Step[0550/1602], Avg Loss: 4.6881, Avg Acc: 0.1360
+INFO:local_logger:Epoch[017/300], Step[0550/1602], Avg Loss: 4.7490, Avg Acc: 0.1306
+INFO:local_logger:Epoch[017/300], Step[0550/1602], Avg Loss: 4.6624, Avg Acc: 0.1446
+INFO:master_logger:Epoch[017/300], Step[0550/1602], Avg Loss: 4.6874, Avg Acc: 0.1388
+INFO:local_logger:Epoch[017/300], Step[0550/1602], Avg Loss: 4.6503, Avg Acc: 0.1438
+INFO:local_logger:Epoch[017/300], Step[0600/1602], Avg Loss: 4.6827, Avg Acc: 0.1365
+INFO:local_logger:Epoch[017/300], Step[0600/1602], Avg Loss: 4.6507, Avg Acc: 0.1428
+INFO:local_logger:Epoch[017/300], Step[0600/1602], Avg Loss: 4.6702, Avg Acc: 0.1443
+INFO:master_logger:Epoch[017/300], Step[0600/1602], Avg Loss: 4.6887, Avg Acc: 0.1385
+INFO:local_logger:Epoch[017/300], Step[0600/1602], Avg Loss: 4.7514, Avg Acc: 0.1305
+INFO:local_logger:Epoch[017/300], Step[0650/1602], Avg Loss: 4.6834, Avg Acc: 0.1382
+INFO:local_logger:Epoch[017/300], Step[0650/1602], Avg Loss: 4.6488, Avg Acc: 0.1420
+INFO:local_logger:Epoch[017/300], Step[0650/1602], Avg Loss: 4.6802, Avg Acc: 0.1435
+INFO:local_logger:Epoch[017/300], Step[0650/1602], Avg Loss: 4.7458, Avg Acc: 0.1318
+INFO:master_logger:Epoch[017/300], Step[0650/1602], Avg Loss: 4.6895, Avg Acc: 0.1389
+INFO:local_logger:Epoch[017/300], Step[0700/1602], Avg Loss: 4.6796, Avg Acc: 0.1385
+INFO:local_logger:Epoch[017/300], Step[0700/1602], Avg Loss: 4.7378, Avg Acc: 0.1319
+INFO:local_logger:Epoch[017/300], Step[0700/1602], Avg Loss: 4.6426, Avg Acc: 0.1431
+INFO:local_logger:Epoch[017/300], Step[0700/1602], Avg Loss: 4.6926, Avg Acc: 0.1419
+INFO:master_logger:Epoch[017/300], Step[0700/1602], Avg Loss: 4.6881, Avg Acc: 0.1388
+INFO:local_logger:Epoch[017/300], Step[0750/1602], Avg Loss: 4.6857, Avg Acc: 0.1377
+INFO:local_logger:Epoch[017/300], Step[0750/1602], Avg Loss: 4.6435, Avg Acc: 0.1418
+INFO:local_logger:Epoch[017/300], Step[0750/1602], Avg Loss: 4.7306, Avg Acc: 0.1335
+INFO:local_logger:Epoch[017/300], Step[0750/1602], Avg Loss: 4.6965, Avg Acc: 0.1427
+INFO:master_logger:Epoch[017/300], Step[0750/1602], Avg Loss: 4.6891, Avg Acc: 0.1389
+INFO:local_logger:Epoch[017/300], Step[0800/1602], Avg Loss: 4.6877, Avg Acc: 0.1367
+INFO:local_logger:Epoch[017/300], Step[0800/1602], Avg Loss: 4.6452, Avg Acc: 0.1415
+INFO:local_logger:Epoch[017/300], Step[0800/1602], Avg Loss: 4.6956, Avg Acc: 0.1420
+INFO:master_logger:Epoch[017/300], Step[0800/1602], Avg Loss: 4.6892, Avg Acc: 0.1383
+INFO:local_logger:Epoch[017/300], Step[0800/1602], Avg Loss: 4.7284, Avg Acc: 0.1330
+INFO:local_logger:Epoch[017/300], Step[0850/1602], Avg Loss: 4.6863, Avg Acc: 0.1369
+INFO:local_logger:Epoch[017/300], Step[0850/1602], Avg Loss: 4.7270, Avg Acc: 0.1323
+INFO:local_logger:Epoch[017/300], Step[0850/1602], Avg Loss: 4.6549, Avg Acc: 0.1405
+INFO:master_logger:Epoch[017/300], Step[0850/1602], Avg Loss: 4.6914, Avg Acc: 0.1378
+INFO:local_logger:Epoch[017/300], Step[0850/1602], Avg Loss: 4.6976, Avg Acc: 0.1414
+INFO:local_logger:Epoch[017/300], Step[0900/1602], Avg Loss: 4.6847, Avg Acc: 0.1364
+INFO:local_logger:Epoch[017/300], Step[0900/1602], Avg Loss: 4.6536, Avg Acc: 0.1396
+INFO:local_logger:Epoch[017/300], Step[0900/1602], Avg Loss: 4.6974, Avg Acc: 0.1423
+INFO:master_logger:Epoch[017/300], Step[0900/1602], Avg Loss: 4.6895, Avg Acc: 0.1377
+INFO:local_logger:Epoch[017/300], Step[0900/1602], Avg Loss: 4.7223, Avg Acc: 0.1325
+INFO:local_logger:Epoch[017/300], Step[0950/1602], Avg Loss: 4.6924, Avg Acc: 0.1420
+INFO:local_logger:Epoch[017/300], Step[0950/1602], Avg Loss: 4.6802, Avg Acc: 0.1366
+INFO:local_logger:Epoch[017/300], Step[0950/1602], Avg Loss: 4.7239, Avg Acc: 0.1317
+INFO:master_logger:Epoch[017/300], Step[0950/1602], Avg Loss: 4.6870, Avg Acc: 0.1376
+INFO:local_logger:Epoch[017/300], Step[0950/1602], Avg Loss: 4.6517, Avg Acc: 0.1399
+INFO:local_logger:Epoch[017/300], Step[1000/1602], Avg Loss: 4.6804, Avg Acc: 0.1376
+INFO:master_logger:Epoch[017/300], Step[1000/1602], Avg Loss: 4.6884, Avg Acc: 0.1373
+INFO:local_logger:Epoch[017/300], Step[1000/1602], Avg Loss: 4.7199, Avg Acc: 0.1310
+INFO:local_logger:Epoch[017/300], Step[1000/1602], Avg Loss: 4.6954, Avg Acc: 0.1410
+INFO:local_logger:Epoch[017/300], Step[1000/1602], Avg Loss: 4.6580, Avg Acc: 0.1395
+INFO:local_logger:Epoch[017/300], Step[1050/1602], Avg Loss: 4.6841, Avg Acc: 0.1381
+INFO:local_logger:Epoch[017/300], Step[1050/1602], Avg Loss: 4.6606, Avg Acc: 0.1386
+INFO:local_logger:Epoch[017/300], Step[1050/1602], Avg Loss: 4.6931, Avg Acc: 0.1410
+INFO:master_logger:Epoch[017/300], Step[1050/1602], Avg Loss: 4.6895, Avg Acc: 0.1374
+INFO:local_logger:Epoch[017/300], Step[1050/1602], Avg Loss: 4.7202, Avg Acc: 0.1318
+INFO:local_logger:Epoch[017/300], Step[1100/1602], Avg Loss: 4.6783, Avg Acc: 0.1381
+INFO:local_logger:Epoch[017/300], Step[1100/1602], Avg Loss: 4.7179, Avg Acc: 0.1331
+INFO:master_logger:Epoch[017/300], Step[1100/1602], Avg Loss: 4.6884, Avg Acc: 0.1376
+INFO:local_logger:Epoch[017/300], Step[1100/1602], Avg Loss: 4.6961, Avg Acc: 0.1405
+INFO:local_logger:Epoch[017/300], Step[1100/1602], Avg Loss: 4.6613, Avg Acc: 0.1385
+INFO:local_logger:Epoch[017/300], Step[1150/1602], Avg Loss: 4.6796, Avg Acc: 0.1384
+INFO:local_logger:Epoch[017/300], Step[1150/1602], Avg Loss: 4.7151, Avg Acc: 0.1350
+INFO:local_logger:Epoch[017/300], Step[1150/1602], Avg Loss: 4.6914, Avg Acc: 0.1422
+INFO:master_logger:Epoch[017/300], Step[1150/1602], Avg Loss: 4.6878, Avg Acc: 0.1387
+INFO:local_logger:Epoch[017/300], Step[1150/1602], Avg Loss: 4.6649, Avg Acc: 0.1393
+INFO:local_logger:Epoch[017/300], Step[1200/1602], Avg Loss: 4.6793, Avg Acc: 0.1385
+INFO:local_logger:Epoch[017/300], Step[1200/1602], Avg Loss: 4.6590, Avg Acc: 0.1397
+INFO:local_logger:Epoch[017/300], Step[1200/1602], Avg Loss: 4.6907, Avg Acc: 0.1420
+INFO:master_logger:Epoch[017/300], Step[1200/1602], Avg Loss: 4.6859, Avg Acc: 0.1387
+INFO:local_logger:Epoch[017/300], Step[1200/1602], Avg Loss: 4.7147, Avg Acc: 0.1344
+INFO:local_logger:Epoch[017/300], Step[1250/1602], Avg Loss: 4.6786, Avg Acc: 0.1387
+INFO:local_logger:Epoch[017/300], Step[1250/1602], Avg Loss: 4.6564, Avg Acc: 0.1406
+INFO:local_logger:Epoch[017/300], Step[1250/1602], Avg Loss: 4.6855, Avg Acc: 0.1423
+INFO:local_logger:Epoch[017/300], Step[1250/1602], Avg Loss: 4.7130, Avg Acc: 0.1349
+INFO:master_logger:Epoch[017/300], Step[1250/1602], Avg Loss: 4.6833, Avg Acc: 0.1391
+INFO:local_logger:Epoch[017/300], Step[1300/1602], Avg Loss: 4.6753, Avg Acc: 0.1397
+INFO:local_logger:Epoch[017/300], Step[1300/1602], Avg Loss: 4.6827, Avg Acc: 0.1422
+INFO:local_logger:Epoch[017/300], Step[1300/1602], Avg Loss: 4.6544, Avg Acc: 0.1409
+INFO:local_logger:Epoch[017/300], Step[1300/1602], Avg Loss: 4.7073, Avg Acc: 0.1342
+INFO:master_logger:Epoch[017/300], Step[1300/1602], Avg Loss: 4.6799, Avg Acc: 0.1392
+INFO:local_logger:Epoch[017/300], Step[1350/1602], Avg Loss: 4.6776, Avg Acc: 0.1395
+INFO:local_logger:Epoch[017/300], Step[1350/1602], Avg Loss: 4.6547, Avg Acc: 0.1408
+INFO:local_logger:Epoch[017/300], Step[1350/1602], Avg Loss: 4.6781, Avg Acc: 0.1419
+INFO:local_logger:Epoch[017/300], Step[1350/1602], Avg Loss: 4.7060, Avg Acc: 0.1349
+INFO:master_logger:Epoch[017/300], Step[1350/1602], Avg Loss: 4.6791, Avg Acc: 0.1393
+INFO:local_logger:Epoch[017/300], Step[1400/1602], Avg Loss: 4.6558, Avg Acc: 0.1416
+INFO:local_logger:Epoch[017/300], Step[1400/1602], Avg Loss: 4.6768, Avg Acc: 0.1398
+INFO:local_logger:Epoch[017/300], Step[1400/1602], Avg Loss: 4.6758, Avg Acc: 0.1427
+INFO:local_logger:Epoch[017/300], Step[1400/1602], Avg Loss: 4.7033, Avg Acc: 0.1354
+INFO:master_logger:Epoch[017/300], Step[1400/1602], Avg Loss: 4.6779, Avg Acc: 0.1399
+INFO:local_logger:Epoch[017/300], Step[1450/1602], Avg Loss: 4.6783, Avg Acc: 0.1401
+INFO:local_logger:Epoch[017/300], Step[1450/1602], Avg Loss: 4.7000, Avg Acc: 0.1360
+INFO:local_logger:Epoch[017/300], Step[1450/1602], Avg Loss: 4.6489, Avg Acc: 0.1421
+INFO:local_logger:Epoch[017/300], Step[1450/1602], Avg Loss: 4.6731, Avg Acc: 0.1427
+INFO:master_logger:Epoch[017/300], Step[1450/1602], Avg Loss: 4.6751, Avg Acc: 0.1402
+INFO:local_logger:Epoch[017/300], Step[1500/1602], Avg Loss: 4.6781, Avg Acc: 0.1398
+INFO:local_logger:Epoch[017/300], Step[1500/1602], Avg Loss: 4.6509, Avg Acc: 0.1420
+INFO:local_logger:Epoch[017/300], Step[1500/1602], Avg Loss: 4.6714, Avg Acc: 0.1430
+INFO:local_logger:Epoch[017/300], Step[1500/1602], Avg Loss: 4.6973, Avg Acc: 0.1362
+INFO:master_logger:Epoch[017/300], Step[1500/1602], Avg Loss: 4.6744, Avg Acc: 0.1402
+INFO:local_logger:Epoch[017/300], Step[1550/1602], Avg Loss: 4.6743, Avg Acc: 0.1404
+INFO:local_logger:Epoch[017/300], Step[1550/1602], Avg Loss: 4.6681, Avg Acc: 0.1430
+INFO:local_logger:Epoch[017/300], Step[1550/1602], Avg Loss: 4.6975, Avg Acc: 0.1361
+INFO:local_logger:Epoch[017/300], Step[1550/1602], Avg Loss: 4.6530, Avg Acc: 0.1424
+INFO:master_logger:Epoch[017/300], Step[1550/1602], Avg Loss: 4.6732, Avg Acc: 0.1405
+INFO:local_logger:Epoch[017/300], Step[1600/1602], Avg Loss: 4.6531, Avg Acc: 0.1425
+INFO:local_logger:Epoch[017/300], Step[1600/1602], Avg Loss: 4.6777, Avg Acc: 0.1399
+INFO:local_logger:Epoch[017/300], Step[1600/1602], Avg Loss: 4.6686, Avg Acc: 0.1434
+INFO:local_logger:Epoch[017/300], Step[1600/1602], Avg Loss: 4.6948, Avg Acc: 0.1364
+INFO:master_logger:Epoch[017/300], Step[1600/1602], Avg Loss: 4.6735, Avg Acc: 0.1405
+INFO:local_logger:----- Epoch[017/300], Train Loss: 4.6949, Train Acc: 0.1365, time: 3703.21
+INFO:local_logger:----- Epoch[017/300], Train Loss: 4.6688, Train Acc: 0.1434, time: 3703.27
+INFO:local_logger:Now training epoch 18. LR=0.000389
+INFO:local_logger:Now training epoch 18. LR=0.000389
+INFO:local_logger:----- Epoch[017/300], Train Loss: 4.6532, Train Acc: 0.1425, time: 3703.28
+INFO:local_logger:Now training epoch 18. LR=0.000389
+INFO:local_logger:----- Epoch[017/300], Train Loss: 4.6775, Train Acc: 0.1398, time: 3702.95
+INFO:master_logger:----- Epoch[017/300], Train Loss: 4.6736, Train Acc: 0.1405, time: 3702.95
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-17-Loss-4.677533307057099.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-17-Loss-4.677533307057099.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-17-Loss-4.677533307057099-EMA.pdparams
+INFO:local_logger:Now training epoch 18. LR=0.000389
+INFO:master_logger:Now training epoch 18. LR=0.000389
+INFO:local_logger:Epoch[018/300], Step[0000/1602], Avg Loss: 5.2285, Avg Acc: 0.0700
+INFO:local_logger:Epoch[018/300], Step[0000/1602], Avg Loss: 5.1637, Avg Acc: 0.0900
+INFO:master_logger:Epoch[018/300], Step[0000/1602], Avg Loss: 4.7727, Avg Acc: 0.1200
+INFO:local_logger:Epoch[018/300], Step[0000/1602], Avg Loss: 3.7457, Avg Acc: 0.3000
+INFO:local_logger:Epoch[018/300], Step[0000/1602], Avg Loss: 4.9531, Avg Acc: 0.0200
+INFO:local_logger:Epoch[018/300], Step[0050/1602], Avg Loss: 4.6603, Avg Acc: 0.1589
+INFO:local_logger:Epoch[018/300], Step[0050/1602], Avg Loss: 4.6177, Avg Acc: 0.1254
+INFO:local_logger:Epoch[018/300], Step[0050/1602], Avg Loss: 4.6920, Avg Acc: 0.1483
+INFO:local_logger:Epoch[018/300], Step[0050/1602], Avg Loss: 4.5306, Avg Acc: 0.1533
+INFO:master_logger:Epoch[018/300], Step[0050/1602], Avg Loss: 4.6252, Avg Acc: 0.1465
+INFO:local_logger:Epoch[018/300], Step[0100/1602], Avg Loss: 4.6110, Avg Acc: 0.1592
+INFO:master_logger:Epoch[018/300], Step[0100/1602], Avg Loss: 4.5799, Avg Acc: 0.1495
+INFO:local_logger:Epoch[018/300], Step[0100/1602], Avg Loss: 4.5617, Avg Acc: 0.1473
+INFO:local_logger:Epoch[018/300], Step[0100/1602], Avg Loss: 4.6090, Avg Acc: 0.1400
+INFO:local_logger:Epoch[018/300], Step[0100/1602], Avg Loss: 4.5380, Avg Acc: 0.1515
+INFO:local_logger:Epoch[018/300], Step[0150/1602], Avg Loss: 4.5482, Avg Acc: 0.1506
+INFO:local_logger:Epoch[018/300], Step[0150/1602], Avg Loss: 4.6073, Avg Acc: 0.1357
+INFO:local_logger:Epoch[018/300], Step[0150/1602], Avg Loss: 4.6131, Avg Acc: 0.1460
+INFO:local_logger:Epoch[018/300], Step[0150/1602], Avg Loss: 4.6034, Avg Acc: 0.1572
+INFO:master_logger:Epoch[018/300], Step[0150/1602], Avg Loss: 4.5930, Avg Acc: 0.1473
+INFO:local_logger:Epoch[018/300], Step[0200/1602], Avg Loss: 4.5695, Avg Acc: 0.1537
+INFO:local_logger:Epoch[018/300], Step[0200/1602], Avg Loss: 4.5850, Avg Acc: 0.1621
+INFO:local_logger:Epoch[018/300], Step[0200/1602], Avg Loss: 4.6195, Avg Acc: 0.1314
+INFO:local_logger:Epoch[018/300], Step[0200/1602], Avg Loss: 4.5996, Avg Acc: 0.1441
+INFO:master_logger:Epoch[018/300], Step[0200/1602], Avg Loss: 4.5934, Avg Acc: 0.1478
+INFO:local_logger:Epoch[018/300], Step[0250/1602], Avg Loss: 4.5944, Avg Acc: 0.1614
+INFO:local_logger:Epoch[018/300], Step[0250/1602], Avg Loss: 4.5780, Avg Acc: 0.1497
+INFO:local_logger:Epoch[018/300], Step[0250/1602], Avg Loss: 4.6231, Avg Acc: 0.1362
+INFO:local_logger:Epoch[018/300], Step[0250/1602], Avg Loss: 4.5997, Avg Acc: 0.1503
+INFO:master_logger:Epoch[018/300], Step[0250/1602], Avg Loss: 4.5988, Avg Acc: 0.1494
+INFO:local_logger:Epoch[018/300], Step[0300/1602], Avg Loss: 4.6007, Avg Acc: 0.1501
+INFO:local_logger:Epoch[018/300], Step[0300/1602], Avg Loss: 4.5644, Avg Acc: 0.1640
+INFO:local_logger:Epoch[018/300], Step[0300/1602], Avg Loss: 4.5920, Avg Acc: 0.1518
+INFO:local_logger:Epoch[018/300], Step[0300/1602], Avg Loss: 4.6108, Avg Acc: 0.1413
+INFO:master_logger:Epoch[018/300], Step[0300/1602], Avg Loss: 4.5920, Avg Acc: 0.1518
+INFO:local_logger:Epoch[018/300], Step[0350/1602], Avg Loss: 4.6078, Avg Acc: 0.1538
+INFO:local_logger:Epoch[018/300], Step[0350/1602], Avg Loss: 4.5714, Avg Acc: 0.1609
+INFO:local_logger:Epoch[018/300], Step[0350/1602], Avg Loss: 4.6146, Avg Acc: 0.1407
+INFO:local_logger:Epoch[018/300], Step[0350/1602], Avg Loss: 4.5970, Avg Acc: 0.1516
+INFO:master_logger:Epoch[018/300], Step[0350/1602], Avg Loss: 4.5977, Avg Acc: 0.1517
+INFO:local_logger:Epoch[018/300], Step[0400/1602], Avg Loss: 4.5840, Avg Acc: 0.1598
+INFO:local_logger:Epoch[018/300], Step[0400/1602], Avg Loss: 4.6184, Avg Acc: 0.1424
+INFO:local_logger:Epoch[018/300], Step[0400/1602], Avg Loss: 4.6041, Avg Acc: 0.1499
+INFO:local_logger:Epoch[018/300], Step[0400/1602], Avg Loss: 4.6141, Avg Acc: 0.1517
+INFO:master_logger:Epoch[018/300], Step[0400/1602], Avg Loss: 4.6052, Avg Acc: 0.1509
+INFO:local_logger:Epoch[018/300], Step[0450/1602], Avg Loss: 4.6195, Avg Acc: 0.1501
+INFO:local_logger:Epoch[018/300], Step[0450/1602], Avg Loss: 4.5903, Avg Acc: 0.1604
+INFO:local_logger:Epoch[018/300], Step[0450/1602], Avg Loss: 4.6169, Avg Acc: 0.1431
+INFO:local_logger:Epoch[018/300], Step[0450/1602], Avg Loss: 4.6242, Avg Acc: 0.1476
+INFO:master_logger:Epoch[018/300], Step[0450/1602], Avg Loss: 4.6127, Avg Acc: 0.1503
+INFO:local_logger:Epoch[018/300], Step[0500/1602], Avg Loss: 4.6014, Avg Acc: 0.1552
+INFO:local_logger:Epoch[018/300], Step[0500/1602], Avg Loss: 4.6147, Avg Acc: 0.1467
+INFO:local_logger:Epoch[018/300], Step[0500/1602], Avg Loss: 4.6195, Avg Acc: 0.1435
+INFO:local_logger:Epoch[018/300], Step[0500/1602], Avg Loss: 4.6195, Avg Acc: 0.1507
+INFO:master_logger:Epoch[018/300], Step[0500/1602], Avg Loss: 4.6138, Avg Acc: 0.1490
+INFO:local_logger:Epoch[018/300], Step[0550/1602], Avg Loss: 4.6144, Avg Acc: 0.1492
+INFO:local_logger:Epoch[018/300], Step[0550/1602], Avg Loss: 4.6152, Avg Acc: 0.1466
+INFO:local_logger:Epoch[018/300], Step[0550/1602], Avg Loss: 4.6092, Avg Acc: 0.1525
+INFO:local_logger:Epoch[018/300], Step[0550/1602], Avg Loss: 4.6256, Avg Acc: 0.1418
+INFO:master_logger:Epoch[018/300], Step[0550/1602], Avg Loss: 4.6161, Avg Acc: 0.1475
+INFO:local_logger:Epoch[018/300], Step[0600/1602], Avg Loss: 4.6109, Avg Acc: 0.1525
+INFO:local_logger:Epoch[018/300], Step[0600/1602], Avg Loss: 4.6153, Avg Acc: 0.1502
+INFO:local_logger:Epoch[018/300], Step[0600/1602], Avg Loss: 4.6185, Avg Acc: 0.1481
+INFO:local_logger:Epoch[018/300], Step[0600/1602], Avg Loss: 4.6274, Avg Acc: 0.1417
+INFO:master_logger:Epoch[018/300], Step[0600/1602], Avg Loss: 4.6180, Avg Acc: 0.1481
+INFO:local_logger:Epoch[018/300], Step[0650/1602], Avg Loss: 4.6045, Avg Acc: 0.1521
+INFO:local_logger:Epoch[018/300], Step[0650/1602], Avg Loss: 4.6108, Avg Acc: 0.1512
+INFO:local_logger:Epoch[018/300], Step[0650/1602], Avg Loss: 4.6126, Avg Acc: 0.1486
+INFO:local_logger:Epoch[018/300], Step[0650/1602], Avg Loss: 4.6218, Avg Acc: 0.1414
+INFO:master_logger:Epoch[018/300], Step[0650/1602], Avg Loss: 4.6124, Avg Acc: 0.1483
+INFO:local_logger:Epoch[018/300], Step[0700/1602], Avg Loss: 4.6162, Avg Acc: 0.1525
+INFO:local_logger:Epoch[018/300], Step[0700/1602], Avg Loss: 4.6086, Avg Acc: 0.1477
+INFO:local_logger:Epoch[018/300], Step[0700/1602], Avg Loss: 4.6260, Avg Acc: 0.1408
+INFO:local_logger:Epoch[018/300], Step[0700/1602], Avg Loss: 4.6109, Avg Acc: 0.1515
+INFO:master_logger:Epoch[018/300], Step[0700/1602], Avg Loss: 4.6154, Avg Acc: 0.1481
+INFO:local_logger:Epoch[018/300], Step[0750/1602], Avg Loss: 4.6160, Avg Acc: 0.1514
+INFO:local_logger:Epoch[018/300], Step[0750/1602], Avg Loss: 4.6066, Avg Acc: 0.1477
+INFO:local_logger:Epoch[018/300], Step[0750/1602], Avg Loss: 4.6220, Avg Acc: 0.1427
+INFO:local_logger:Epoch[018/300], Step[0750/1602], Avg Loss: 4.6123, Avg Acc: 0.1516
+INFO:master_logger:Epoch[018/300], Step[0750/1602], Avg Loss: 4.6142, Avg Acc: 0.1484
+INFO:local_logger:Epoch[018/300], Step[0800/1602], Avg Loss: 4.6245, Avg Acc: 0.1515
+INFO:local_logger:Epoch[018/300], Step[0800/1602], Avg Loss: 4.6138, Avg Acc: 0.1494
+INFO:local_logger:Epoch[018/300], Step[0800/1602], Avg Loss: 4.6067, Avg Acc: 0.1480
+INFO:local_logger:Epoch[018/300], Step[0800/1602], Avg Loss: 4.6130, Avg Acc: 0.1426
+INFO:master_logger:Epoch[018/300], Step[0800/1602], Avg Loss: 4.6145, Avg Acc: 0.1479
+INFO:local_logger:Epoch[018/300], Step[0850/1602], Avg Loss: 4.6264, Avg Acc: 0.1516
+INFO:local_logger:Epoch[018/300], Step[0850/1602], Avg Loss: 4.6118, Avg Acc: 0.1482
+INFO:local_logger:Epoch[018/300], Step[0850/1602], Avg Loss: 4.6089, Avg Acc: 0.1425
+INFO:master_logger:Epoch[018/300], Step[0850/1602], Avg Loss: 4.6159, Avg Acc: 0.1479
+INFO:local_logger:Epoch[018/300], Step[0850/1602], Avg Loss: 4.6167, Avg Acc: 0.1492
+INFO:local_logger:Epoch[018/300], Step[0900/1602], Avg Loss: 4.6212, Avg Acc: 0.1521
+INFO:local_logger:Epoch[018/300], Step[0900/1602], Avg Loss: 4.6139, Avg Acc: 0.1481
+INFO:local_logger:Epoch[018/300], Step[0900/1602], Avg Loss: 4.6096, Avg Acc: 0.1415
+INFO:master_logger:Epoch[018/300], Step[0900/1602], Avg Loss: 4.6149, Avg Acc: 0.1478
+INFO:local_logger:Epoch[018/300], Step[0900/1602], Avg Loss: 4.6149, Avg Acc: 0.1496
+INFO:local_logger:Epoch[018/300], Step[0950/1602], Avg Loss: 4.6161, Avg Acc: 0.1524
+INFO:local_logger:Epoch[018/300], Step[0950/1602], Avg Loss: 4.6125, Avg Acc: 0.1488
+INFO:local_logger:Epoch[018/300], Step[0950/1602], Avg Loss: 4.6148, Avg Acc: 0.1501
+INFO:master_logger:Epoch[018/300], Step[0950/1602], Avg Loss: 4.6128, Avg Acc: 0.1481
+INFO:local_logger:Epoch[018/300], Step[0950/1602], Avg Loss: 4.6079, Avg Acc: 0.1409
+INFO:local_logger:Epoch[018/300], Step[1000/1602], Avg Loss: 4.6174, Avg Acc: 0.1530
+INFO:local_logger:Epoch[018/300], Step[1000/1602], Avg Loss: 4.5995, Avg Acc: 0.1420
+INFO:local_logger:Epoch[018/300], Step[1000/1602], Avg Loss: 4.6131, Avg Acc: 0.1476
+INFO:local_logger:Epoch[018/300], Step[1000/1602], Avg Loss: 4.6177, Avg Acc: 0.1499
+INFO:master_logger:Epoch[018/300], Step[1000/1602], Avg Loss: 4.6119, Avg Acc: 0.1481
+INFO:local_logger:Epoch[018/300], Step[1050/1602], Avg Loss: 4.6191, Avg Acc: 0.1525
+INFO:local_logger:Epoch[018/300], Step[1050/1602], Avg Loss: 4.6011, Avg Acc: 0.1422
+INFO:local_logger:Epoch[018/300], Step[1050/1602], Avg Loss: 4.6051, Avg Acc: 0.1490
+INFO:master_logger:Epoch[018/300], Step[1050/1602], Avg Loss: 4.6097, Avg Acc: 0.1485
+INFO:local_logger:Epoch[018/300], Step[1050/1602], Avg Loss: 4.6135, Avg Acc: 0.1502
+INFO:local_logger:Epoch[018/300], Step[1100/1602], Avg Loss: 4.6141, Avg Acc: 0.1518
+INFO:local_logger:Epoch[018/300], Step[1100/1602], Avg Loss: 4.6081, Avg Acc: 0.1425
+INFO:local_logger:Epoch[018/300], Step[1100/1602], Avg Loss: 4.6062, Avg Acc: 0.1504
+INFO:local_logger:Epoch[018/300], Step[1100/1602], Avg Loss: 4.6113, Avg Acc: 0.1511
+INFO:master_logger:Epoch[018/300], Step[1100/1602], Avg Loss: 4.6099, Avg Acc: 0.1490
+INFO:local_logger:Epoch[018/300], Step[1150/1602], Avg Loss: 4.6143, Avg Acc: 0.1509
+INFO:local_logger:Epoch[018/300], Step[1150/1602], Avg Loss: 4.6050, Avg Acc: 0.1510
+INFO:local_logger:Epoch[018/300], Step[1150/1602], Avg Loss: 4.6092, Avg Acc: 0.1506
+INFO:local_logger:Epoch[018/300], Step[1150/1602], Avg Loss: 4.6000, Avg Acc: 0.1438
+INFO:master_logger:Epoch[018/300], Step[1150/1602], Avg Loss: 4.6071, Avg Acc: 0.1491
+INFO:local_logger:Epoch[018/300], Step[1200/1602], Avg Loss: 4.6154, Avg Acc: 0.1514
+INFO:local_logger:Epoch[018/300], Step[1200/1602], Avg Loss: 4.6144, Avg Acc: 0.1506
+INFO:local_logger:Epoch[018/300], Step[1200/1602], Avg Loss: 4.6006, Avg Acc: 0.1434
+INFO:local_logger:Epoch[018/300], Step[1200/1602], Avg Loss: 4.6074, Avg Acc: 0.1509
+INFO:master_logger:Epoch[018/300], Step[1200/1602], Avg Loss: 4.6095, Avg Acc: 0.1491
+INFO:local_logger:Epoch[018/300], Step[1250/1602], Avg Loss: 4.6153, Avg Acc: 0.1505
+INFO:local_logger:Epoch[018/300], Step[1250/1602], Avg Loss: 4.6051, Avg Acc: 0.1437
+INFO:local_logger:Epoch[018/300], Step[1250/1602], Avg Loss: 4.6099, Avg Acc: 0.1514
+INFO:master_logger:Epoch[018/300], Step[1250/1602], Avg Loss: 4.6113, Avg Acc: 0.1491
+INFO:local_logger:Epoch[018/300], Step[1250/1602], Avg Loss: 4.6147, Avg Acc: 0.1508
+INFO:local_logger:Epoch[018/300], Step[1300/1602], Avg Loss: 4.6149, Avg Acc: 0.1513
+INFO:local_logger:Epoch[018/300], Step[1300/1602], Avg Loss: 4.6170, Avg Acc: 0.1505
+INFO:local_logger:Epoch[018/300], Step[1300/1602], Avg Loss: 4.6106, Avg Acc: 0.1513
+INFO:local_logger:Epoch[018/300], Step[1300/1602], Avg Loss: 4.6052, Avg Acc: 0.1442
+INFO:master_logger:Epoch[018/300], Step[1300/1602], Avg Loss: 4.6120, Avg Acc: 0.1493
+INFO:local_logger:Epoch[018/300], Step[1350/1602], Avg Loss: 4.6153, Avg Acc: 0.1501
+INFO:local_logger:Epoch[018/300], Step[1350/1602], Avg Loss: 4.6117, Avg Acc: 0.1517
+INFO:local_logger:Epoch[018/300], Step[1350/1602], Avg Loss: 4.6122, Avg Acc: 0.1491
+INFO:master_logger:Epoch[018/300], Step[1350/1602], Avg Loss: 4.6106, Avg Acc: 0.1489
+INFO:local_logger:Epoch[018/300], Step[1350/1602], Avg Loss: 4.6030, Avg Acc: 0.1447
+INFO:local_logger:Epoch[018/300], Step[1400/1602], Avg Loss: 4.6102, Avg Acc: 0.1509
+INFO:local_logger:Epoch[018/300], Step[1400/1602], Avg Loss: 4.6111, Avg Acc: 0.1483
+INFO:local_logger:Epoch[018/300], Step[1400/1602], Avg Loss: 4.6034, Avg Acc: 0.1453
+INFO:master_logger:Epoch[018/300], Step[1400/1602], Avg Loss: 4.6094, Avg Acc: 0.1490
+INFO:local_logger:Epoch[018/300], Step[1400/1602], Avg Loss: 4.6131, Avg Acc: 0.1516
+INFO:local_logger:Epoch[018/300], Step[1450/1602], Avg Loss: 4.6109, Avg Acc: 0.1473
+INFO:local_logger:Epoch[018/300], Step[1450/1602], Avg Loss: 4.6126, Avg Acc: 0.1514
+INFO:local_logger:Epoch[018/300], Step[1450/1602], Avg Loss: 4.6146, Avg Acc: 0.1501
+INFO:local_logger:Epoch[018/300], Step[1450/1602], Avg Loss: 4.6029, Avg Acc: 0.1456
+INFO:master_logger:Epoch[018/300], Step[1450/1602], Avg Loss: 4.6103, Avg Acc: 0.1486
+INFO:local_logger:Epoch[018/300], Step[1500/1602], Avg Loss: 4.6107, Avg Acc: 0.1504
+INFO:local_logger:Epoch[018/300], Step[1500/1602], Avg Loss: 4.6134, Avg Acc: 0.1509
+INFO:local_logger:Epoch[018/300], Step[1500/1602], Avg Loss: 4.6143, Avg Acc: 0.1476
+INFO:master_logger:Epoch[018/300], Step[1500/1602], Avg Loss: 4.6104, Avg Acc: 0.1489
+INFO:local_logger:Epoch[018/300], Step[1500/1602], Avg Loss: 4.6030, Avg Acc: 0.1464
+INFO:local_logger:Epoch[018/300], Step[1550/1602], Avg Loss: 4.6151, Avg Acc: 0.1502
+INFO:local_logger:Epoch[018/300], Step[1550/1602], Avg Loss: 4.6115, Avg Acc: 0.1500
+INFO:local_logger:Epoch[018/300], Step[1550/1602], Avg Loss: 4.6127, Avg Acc: 0.1470
+INFO:local_logger:Epoch[018/300], Step[1550/1602], Avg Loss: 4.6038, Avg Acc: 0.1465
+INFO:master_logger:Epoch[018/300], Step[1550/1602], Avg Loss: 4.6108, Avg Acc: 0.1484
+INFO:local_logger:Epoch[018/300], Step[1600/1602], Avg Loss: 4.6113, Avg Acc: 0.1468
+INFO:local_logger:Epoch[018/300], Step[1600/1602], Avg Loss: 4.6133, Avg Acc: 0.1503
+INFO:local_logger:Epoch[018/300], Step[1600/1602], Avg Loss: 4.6187, Avg Acc: 0.1496
+INFO:local_logger:Epoch[018/300], Step[1600/1602], Avg Loss: 4.5987, Avg Acc: 0.1472
+INFO:master_logger:Epoch[018/300], Step[1600/1602], Avg Loss: 4.6105, Avg Acc: 0.1485
+INFO:local_logger:----- Epoch[018/300], Train Loss: 4.6113, Train Acc: 0.1467, time: 3715.22
+INFO:local_logger:Now training epoch 19. LR=0.000389
+INFO:local_logger:----- Epoch[018/300], Train Loss: 4.6133, Train Acc: 0.1503, time: 3714.95
+INFO:local_logger:----- Epoch[018/300], Train Loss: 4.6188, Train Acc: 0.1496, time: 3715.22
+INFO:master_logger:----- Epoch[018/300], Train Loss: 4.6105, Train Acc: 0.1485, time: 3714.95
+INFO:local_logger:Now training epoch 19. LR=0.000389
+INFO:local_logger:----- Epoch[018/300], Train Loss: 4.5987, Train Acc: 0.1473, time: 3715.24
+INFO:local_logger:Now training epoch 19. LR=0.000389
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-18-Loss-4.613320102000628.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-18-Loss-4.613320102000628.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-18-Loss-4.613320102000628-EMA.pdparams
+INFO:local_logger:Now training epoch 19. LR=0.000389
+INFO:master_logger:Now training epoch 19. LR=0.000389
+INFO:local_logger:Epoch[019/300], Step[0000/1602], Avg Loss: 4.7056, Avg Acc: 0.2000
+INFO:local_logger:Epoch[019/300], Step[0000/1602], Avg Loss: 4.7083, Avg Acc: 0.2300
+INFO:local_logger:Epoch[019/300], Step[0000/1602], Avg Loss: 4.9638, Avg Acc: 0.1200
+INFO:local_logger:Epoch[019/300], Step[0000/1602], Avg Loss: 3.9289, Avg Acc: 0.0050
+INFO:master_logger:Epoch[019/300], Step[0000/1602], Avg Loss: 4.5767, Avg Acc: 0.1388
+INFO:local_logger:Epoch[019/300], Step[0050/1602], Avg Loss: 4.6007, Avg Acc: 0.1625
+INFO:local_logger:Epoch[019/300], Step[0050/1602], Avg Loss: 4.6944, Avg Acc: 0.1346
+INFO:local_logger:Epoch[019/300], Step[0050/1602], Avg Loss: 4.5925, Avg Acc: 0.1586
+INFO:local_logger:Epoch[019/300], Step[0050/1602], Avg Loss: 4.5572, Avg Acc: 0.1624
+INFO:master_logger:Epoch[019/300], Step[0050/1602], Avg Loss: 4.6112, Avg Acc: 0.1545
+INFO:local_logger:Epoch[019/300], Step[0100/1602], Avg Loss: 4.5447, Avg Acc: 0.1498
+INFO:local_logger:Epoch[019/300], Step[0100/1602], Avg Loss: 4.5951, Avg Acc: 0.1465
+INFO:local_logger:Epoch[019/300], Step[0100/1602], Avg Loss: 4.5436, Avg Acc: 0.1465
+INFO:local_logger:Epoch[019/300], Step[0100/1602], Avg Loss: 4.6408, Avg Acc: 0.1560
+INFO:master_logger:Epoch[019/300], Step[0100/1602], Avg Loss: 4.5811, Avg Acc: 0.1497
+INFO:local_logger:Epoch[019/300], Step[0150/1602], Avg Loss: 4.5478, Avg Acc: 0.1490
+INFO:local_logger:Epoch[019/300], Step[0150/1602], Avg Loss: 4.6278, Avg Acc: 0.1337
+INFO:local_logger:Epoch[019/300], Step[0150/1602], Avg Loss: 4.5442, Avg Acc: 0.1510
+INFO:local_logger:Epoch[019/300], Step[0150/1602], Avg Loss: 4.6539, Avg Acc: 0.1560
+INFO:master_logger:Epoch[019/300], Step[0150/1602], Avg Loss: 4.5934, Avg Acc: 0.1474
+INFO:local_logger:Epoch[019/300], Step[0200/1602], Avg Loss: 4.6147, Avg Acc: 0.1342
+INFO:local_logger:Epoch[019/300], Step[0200/1602], Avg Loss: 4.5595, Avg Acc: 0.1532
+INFO:local_logger:Epoch[019/300], Step[0200/1602], Avg Loss: 4.5593, Avg Acc: 0.1491
+INFO:local_logger:Epoch[019/300], Step[0200/1602], Avg Loss: 4.6370, Avg Acc: 0.1616
+INFO:master_logger:Epoch[019/300], Step[0200/1602], Avg Loss: 4.5926, Avg Acc: 0.1495
+INFO:local_logger:Epoch[019/300], Step[0250/1602], Avg Loss: 4.5786, Avg Acc: 0.1521
+INFO:local_logger:Epoch[019/300], Step[0250/1602], Avg Loss: 4.5610, Avg Acc: 0.1546
+INFO:local_logger:Epoch[019/300], Step[0250/1602], Avg Loss: 4.6334, Avg Acc: 0.1337
+INFO:local_logger:Epoch[019/300], Step[0250/1602], Avg Loss: 4.6320, Avg Acc: 0.1579
+INFO:master_logger:Epoch[019/300], Step[0250/1602], Avg Loss: 4.6012, Avg Acc: 0.1496
+INFO:local_logger:Epoch[019/300], Step[0300/1602], Avg Loss: 4.5577, Avg Acc: 0.1582
+INFO:local_logger:Epoch[019/300], Step[0300/1602], Avg Loss: 4.6124, Avg Acc: 0.1611
+INFO:local_logger:Epoch[019/300], Step[0300/1602], Avg Loss: 4.5628, Avg Acc: 0.1510
+INFO:local_logger:Epoch[019/300], Step[0300/1602], Avg Loss: 4.6207, Avg Acc: 0.1360
+INFO:master_logger:Epoch[019/300], Step[0300/1602], Avg Loss: 4.5884, Avg Acc: 0.1516
+INFO:local_logger:Epoch[019/300], Step[0350/1602], Avg Loss: 4.6295, Avg Acc: 0.1366
+INFO:local_logger:Epoch[019/300], Step[0350/1602], Avg Loss: 4.5743, Avg Acc: 0.1531
+INFO:local_logger:Epoch[019/300], Step[0350/1602], Avg Loss: 4.5519, Avg Acc: 0.1602
+INFO:master_logger:Epoch[019/300], Step[0350/1602], Avg Loss: 4.5901, Avg Acc: 0.1526
+INFO:local_logger:Epoch[019/300], Step[0350/1602], Avg Loss: 4.6048, Avg Acc: 0.1605
+INFO:local_logger:Epoch[019/300], Step[0400/1602], Avg Loss: 4.6155, Avg Acc: 0.1407
+INFO:local_logger:Epoch[019/300], Step[0400/1602], Avg Loss: 4.5690, Avg Acc: 0.1554
+INFO:local_logger:Epoch[019/300], Step[0400/1602], Avg Loss: 4.5681, Avg Acc: 0.1598
+INFO:local_logger:Epoch[019/300], Step[0400/1602], Avg Loss: 4.5986, Avg Acc: 0.1586
+INFO:master_logger:Epoch[019/300], Step[0400/1602], Avg Loss: 4.5878, Avg Acc: 0.1536
+INFO:local_logger:Epoch[019/300], Step[0450/1602], Avg Loss: 4.6018, Avg Acc: 0.1598
+INFO:local_logger:Epoch[019/300], Step[0450/1602], Avg Loss: 4.6191, Avg Acc: 0.1422
+INFO:local_logger:Epoch[019/300], Step[0450/1602], Avg Loss: 4.5760, Avg Acc: 0.1593
+INFO:local_logger:Epoch[019/300], Step[0450/1602], Avg Loss: 4.5650, Avg Acc: 0.1567
+INFO:master_logger:Epoch[019/300], Step[0450/1602], Avg Loss: 4.5905, Avg Acc: 0.1545
+INFO:local_logger:Epoch[019/300], Step[0500/1602], Avg Loss: 4.6080, Avg Acc: 0.1422
+INFO:local_logger:Epoch[019/300], Step[0500/1602], Avg Loss: 4.5685, Avg Acc: 0.1545
+INFO:local_logger:Epoch[019/300], Step[0500/1602], Avg Loss: 4.5797, Avg Acc: 0.1580
+INFO:local_logger:Epoch[019/300], Step[0500/1602], Avg Loss: 4.5781, Avg Acc: 0.1577
+INFO:master_logger:Epoch[019/300], Step[0500/1602], Avg Loss: 4.5836, Avg Acc: 0.1531
+INFO:local_logger:Epoch[019/300], Step[0550/1602], Avg Loss: 4.6128, Avg Acc: 0.1415
+INFO:local_logger:Epoch[019/300], Step[0550/1602], Avg Loss: 4.5667, Avg Acc: 0.1551
+INFO:local_logger:Epoch[019/300], Step[0550/1602], Avg Loss: 4.5756, Avg Acc: 0.1581
+INFO:local_logger:Epoch[019/300], Step[0550/1602], Avg Loss: 4.5819, Avg Acc: 0.1568
+INFO:master_logger:Epoch[019/300], Step[0550/1602], Avg Loss: 4.5842, Avg Acc: 0.1529
+INFO:local_logger:Epoch[019/300], Step[0600/1602], Avg Loss: 4.6210, Avg Acc: 0.1409
+INFO:local_logger:Epoch[019/300], Step[0600/1602], Avg Loss: 4.5742, Avg Acc: 0.1581
+INFO:local_logger:Epoch[019/300], Step[0600/1602], Avg Loss: 4.5821, Avg Acc: 0.1553
+INFO:local_logger:Epoch[019/300], Step[0600/1602], Avg Loss: 4.5632, Avg Acc: 0.1557
+INFO:master_logger:Epoch[019/300], Step[0600/1602], Avg Loss: 4.5851, Avg Acc: 0.1525
+INFO:local_logger:Epoch[019/300], Step[0650/1602], Avg Loss: 4.6003, Avg Acc: 0.1422
+INFO:local_logger:Epoch[019/300], Step[0650/1602], Avg Loss: 4.5834, Avg Acc: 0.1590
+INFO:local_logger:Epoch[019/300], Step[0650/1602], Avg Loss: 4.5733, Avg Acc: 0.1548
+INFO:local_logger:Epoch[019/300], Step[0650/1602], Avg Loss: 4.5706, Avg Acc: 0.1565
+INFO:master_logger:Epoch[019/300], Step[0650/1602], Avg Loss: 4.5819, Avg Acc: 0.1531
+INFO:local_logger:Epoch[019/300], Step[0700/1602], Avg Loss: 4.6017, Avg Acc: 0.1428
+INFO:local_logger:Epoch[019/300], Step[0700/1602], Avg Loss: 4.5749, Avg Acc: 0.1545
+INFO:local_logger:Epoch[019/300], Step[0700/1602], Avg Loss: 4.5715, Avg Acc: 0.1563
+INFO:master_logger:Epoch[019/300], Step[0700/1602], Avg Loss: 4.5841, Avg Acc: 0.1529
+INFO:local_logger:Epoch[019/300], Step[0700/1602], Avg Loss: 4.5883, Avg Acc: 0.1580
+INFO:local_logger:Epoch[019/300], Step[0750/1602], Avg Loss: 4.5894, Avg Acc: 0.1449
+INFO:local_logger:Epoch[019/300], Step[0750/1602], Avg Loss: 4.5869, Avg Acc: 0.1568
+INFO:local_logger:Epoch[019/300], Step[0750/1602], Avg Loss: 4.5751, Avg Acc: 0.1550
+INFO:local_logger:Epoch[019/300], Step[0750/1602], Avg Loss: 4.5834, Avg Acc: 0.1536
+INFO:master_logger:Epoch[019/300], Step[0750/1602], Avg Loss: 4.5837, Avg Acc: 0.1526
+INFO:local_logger:Epoch[019/300], Step[0800/1602], Avg Loss: 4.5915, Avg Acc: 0.1461
+INFO:local_logger:Epoch[019/300], Step[0800/1602], Avg Loss: 4.5858, Avg Acc: 0.1524
+INFO:local_logger:Epoch[019/300], Step[0800/1602], Avg Loss: 4.5817, Avg Acc: 0.1556
+INFO:local_logger:Epoch[019/300], Step[0800/1602], Avg Loss: 4.5893, Avg Acc: 0.1570
+INFO:master_logger:Epoch[019/300], Step[0800/1602], Avg Loss: 4.5871, Avg Acc: 0.1528
+INFO:local_logger:Epoch[019/300], Step[0850/1602], Avg Loss: 4.5865, Avg Acc: 0.1471
+INFO:local_logger:Epoch[019/300], Step[0850/1602], Avg Loss: 4.5772, Avg Acc: 0.1576
+INFO:local_logger:Epoch[019/300], Step[0850/1602], Avg Loss: 4.5824, Avg Acc: 0.1509
+INFO:local_logger:Epoch[019/300], Step[0850/1602], Avg Loss: 4.5863, Avg Acc: 0.1553
+INFO:master_logger:Epoch[019/300], Step[0850/1602], Avg Loss: 4.5831, Avg Acc: 0.1527
+INFO:local_logger:Epoch[019/300], Step[0900/1602], Avg Loss: 4.5790, Avg Acc: 0.1485
+INFO:local_logger:Epoch[019/300], Step[0900/1602], Avg Loss: 4.5774, Avg Acc: 0.1567
+INFO:local_logger:Epoch[019/300], Step[0900/1602], Avg Loss: 4.5847, Avg Acc: 0.1512
+INFO:local_logger:Epoch[019/300], Step[0900/1602], Avg Loss: 4.5883, Avg Acc: 0.1547
+INFO:master_logger:Epoch[019/300], Step[0900/1602], Avg Loss: 4.5823, Avg Acc: 0.1528
+INFO:local_logger:Epoch[019/300], Step[0950/1602], Avg Loss: 4.5849, Avg Acc: 0.1477
+INFO:local_logger:Epoch[019/300], Step[0950/1602], Avg Loss: 4.5907, Avg Acc: 0.1536
+INFO:local_logger:Epoch[019/300], Step[0950/1602], Avg Loss: 4.5786, Avg Acc: 0.1520
+INFO:local_logger:Epoch[019/300], Step[0950/1602], Avg Loss: 4.5777, Avg Acc: 0.1572
+INFO:master_logger:Epoch[019/300], Step[0950/1602], Avg Loss: 4.5830, Avg Acc: 0.1526
+INFO:local_logger:Epoch[019/300], Step[1000/1602], Avg Loss: 4.5772, Avg Acc: 0.1491
+INFO:local_logger:Epoch[019/300], Step[1000/1602], Avg Loss: 4.5737, Avg Acc: 0.1526
+INFO:local_logger:Epoch[019/300], Step[1000/1602], Avg Loss: 4.5757, Avg Acc: 0.1562
+INFO:local_logger:Epoch[019/300], Step[1000/1602], Avg Loss: 4.5905, Avg Acc: 0.1531
+INFO:master_logger:Epoch[019/300], Step[1000/1602], Avg Loss: 4.5793, Avg Acc: 0.1528
+INFO:local_logger:Epoch[019/300], Step[1050/1602], Avg Loss: 4.5778, Avg Acc: 0.1480
+INFO:local_logger:Epoch[019/300], Step[1050/1602], Avg Loss: 4.5740, Avg Acc: 0.1525
+INFO:local_logger:Epoch[019/300], Step[1050/1602], Avg Loss: 4.5721, Avg Acc: 0.1568
+INFO:local_logger:Epoch[019/300], Step[1050/1602], Avg Loss: 4.5916, Avg Acc: 0.1521
+INFO:master_logger:Epoch[019/300], Step[1050/1602], Avg Loss: 4.5789, Avg Acc: 0.1524
+INFO:local_logger:Epoch[019/300], Step[1100/1602], Avg Loss: 4.5886, Avg Acc: 0.1528
+INFO:local_logger:Epoch[019/300], Step[1100/1602], Avg Loss: 4.5756, Avg Acc: 0.1479
+INFO:local_logger:Epoch[019/300], Step[1100/1602], Avg Loss: 4.5761, Avg Acc: 0.1524
+INFO:local_logger:Epoch[019/300], Step[1100/1602], Avg Loss: 4.5709, Avg Acc: 0.1562
+INFO:master_logger:Epoch[019/300], Step[1100/1602], Avg Loss: 4.5778, Avg Acc: 0.1523
+INFO:local_logger:Epoch[019/300], Step[1150/1602], Avg Loss: 4.5751, Avg Acc: 0.1487
+INFO:local_logger:Epoch[019/300], Step[1150/1602], Avg Loss: 4.5889, Avg Acc: 0.1521
+INFO:local_logger:Epoch[019/300], Step[1150/1602], Avg Loss: 4.5714, Avg Acc: 0.1527
+INFO:local_logger:Epoch[019/300], Step[1150/1602], Avg Loss: 4.5720, Avg Acc: 0.1562
+INFO:master_logger:Epoch[019/300], Step[1150/1602], Avg Loss: 4.5768, Avg Acc: 0.1524
+INFO:local_logger:Epoch[019/300], Step[1200/1602], Avg Loss: 4.5736, Avg Acc: 0.1484
+INFO:local_logger:Epoch[019/300], Step[1200/1602], Avg Loss: 4.5699, Avg Acc: 0.1526
+INFO:local_logger:Epoch[019/300], Step[1200/1602], Avg Loss: 4.5759, Avg Acc: 0.1558
+INFO:local_logger:Epoch[019/300], Step[1200/1602], Avg Loss: 4.5893, Avg Acc: 0.1516
+INFO:master_logger:Epoch[019/300], Step[1200/1602], Avg Loss: 4.5772, Avg Acc: 0.1521
+INFO:local_logger:Epoch[019/300], Step[1250/1602], Avg Loss: 4.5700, Avg Acc: 0.1476
+INFO:local_logger:Epoch[019/300], Step[1250/1602], Avg Loss: 4.5749, Avg Acc: 0.1554
+INFO:local_logger:Epoch[019/300], Step[1250/1602], Avg Loss: 4.5880, Avg Acc: 0.1531
+INFO:master_logger:Epoch[019/300], Step[1250/1602], Avg Loss: 4.5763, Avg Acc: 0.1522
+INFO:local_logger:Epoch[019/300], Step[1250/1602], Avg Loss: 4.5723, Avg Acc: 0.1528
+INFO:local_logger:Epoch[019/300], Step[1300/1602], Avg Loss: 4.5679, Avg Acc: 0.1473
+INFO:local_logger:Epoch[019/300], Step[1300/1602], Avg Loss: 4.5702, Avg Acc: 0.1519
+INFO:local_logger:Epoch[019/300], Step[1300/1602], Avg Loss: 4.5766, Avg Acc: 0.1542
+INFO:master_logger:Epoch[019/300], Step[1300/1602], Avg Loss: 4.5766, Avg Acc: 0.1514
+INFO:local_logger:Epoch[019/300], Step[1300/1602], Avg Loss: 4.5917, Avg Acc: 0.1522
+INFO:local_logger:Epoch[019/300], Step[1350/1602], Avg Loss: 4.5901, Avg Acc: 0.1522
+INFO:local_logger:Epoch[019/300], Step[1350/1602], Avg Loss: 4.5616, Avg Acc: 0.1473
+INFO:local_logger:Epoch[019/300], Step[1350/1602], Avg Loss: 4.5757, Avg Acc: 0.1547
+INFO:local_logger:Epoch[019/300], Step[1350/1602], Avg Loss: 4.5696, Avg Acc: 0.1518
+INFO:master_logger:Epoch[019/300], Step[1350/1602], Avg Loss: 4.5742, Avg Acc: 0.1515
+INFO:local_logger:Epoch[019/300], Step[1400/1602], Avg Loss: 4.5855, Avg Acc: 0.1528
+INFO:local_logger:Epoch[019/300], Step[1400/1602], Avg Loss: 4.5592, Avg Acc: 0.1479
+INFO:local_logger:Epoch[019/300], Step[1400/1602], Avg Loss: 4.5748, Avg Acc: 0.1551
+INFO:local_logger:Epoch[019/300], Step[1400/1602], Avg Loss: 4.5686, Avg Acc: 0.1511
+INFO:master_logger:Epoch[019/300], Step[1400/1602], Avg Loss: 4.5720, Avg Acc: 0.1517
+INFO:local_logger:Epoch[019/300], Step[1450/1602], Avg Loss: 4.5649, Avg Acc: 0.1469
+INFO:local_logger:Epoch[019/300], Step[1450/1602], Avg Loss: 4.5748, Avg Acc: 0.1553
+INFO:local_logger:Epoch[019/300], Step[1450/1602], Avg Loss: 4.5685, Avg Acc: 0.1523
+INFO:local_logger:Epoch[019/300], Step[1450/1602], Avg Loss: 4.5834, Avg Acc: 0.1519
+INFO:master_logger:Epoch[019/300], Step[1450/1602], Avg Loss: 4.5729, Avg Acc: 0.1516
+INFO:local_logger:Epoch[019/300], Step[1500/1602], Avg Loss: 4.5828, Avg Acc: 0.1522
+INFO:local_logger:Epoch[019/300], Step[1500/1602], Avg Loss: 4.5639, Avg Acc: 0.1464
+INFO:local_logger:Epoch[019/300], Step[1500/1602], Avg Loss: 4.5760, Avg Acc: 0.1549
+INFO:local_logger:Epoch[019/300], Step[1500/1602], Avg Loss: 4.5665, Avg Acc: 0.1526
+INFO:master_logger:Epoch[019/300], Step[1500/1602], Avg Loss: 4.5723, Avg Acc: 0.1515
+INFO:local_logger:Epoch[019/300], Step[1550/1602], Avg Loss: 4.5790, Avg Acc: 0.1529
+INFO:local_logger:Epoch[019/300], Step[1550/1602], Avg Loss: 4.5626, Avg Acc: 0.1460
+INFO:local_logger:Epoch[019/300], Step[1550/1602], Avg Loss: 4.5631, Avg Acc: 0.1530
+INFO:local_logger:Epoch[019/300], Step[1550/1602], Avg Loss: 4.5755, Avg Acc: 0.1549
+INFO:master_logger:Epoch[019/300], Step[1550/1602], Avg Loss: 4.5700, Avg Acc: 0.1517
+INFO:local_logger:Epoch[019/300], Step[1600/1602], Avg Loss: 4.5613, Avg Acc: 0.1463
+INFO:local_logger:Epoch[019/300], Step[1600/1602], Avg Loss: 4.5785, Avg Acc: 0.1527
+INFO:local_logger:Epoch[019/300], Step[1600/1602], Avg Loss: 4.5724, Avg Acc: 0.1554
+INFO:master_logger:Epoch[019/300], Step[1600/1602], Avg Loss: 4.5686, Avg Acc: 0.1518
+INFO:local_logger:Epoch[019/300], Step[1600/1602], Avg Loss: 4.5620, Avg Acc: 0.1526
+INFO:local_logger:----- Epoch[019/300], Train Loss: 4.5785, Train Acc: 0.1528, time: 3739.43
+INFO:local_logger:Now training epoch 20. LR=0.000388
+INFO:local_logger:----- Epoch[019/300], Train Loss: 4.5726, Train Acc: 0.1554, time: 3739.51
+INFO:local_logger:Now training epoch 20. LR=0.000388
+INFO:local_logger:----- Epoch[019/300], Train Loss: 4.5612, Train Acc: 0.1463, time: 3739.26
+INFO:master_logger:----- Epoch[019/300], Train Loss: 4.5685, Train Acc: 0.1518, time: 3739.26
+INFO:local_logger:----- Epoch[019/300], Train Loss: 4.5618, Train Acc: 0.1526, time: 3739.50
+INFO:local_logger:Now training epoch 20. LR=0.000388
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-19-Loss-4.561244097719863.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-19-Loss-4.561244097719863.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-19-Loss-4.561244097719863-EMA.pdparams
+INFO:local_logger:Now training epoch 20. LR=0.000388
+INFO:master_logger:Now training epoch 20. LR=0.000388
+INFO:local_logger:Epoch[020/300], Step[0000/1602], Avg Loss: 4.0086, Avg Acc: 0.3050
+INFO:local_logger:Epoch[020/300], Step[0000/1602], Avg Loss: 5.0018, Avg Acc: 0.1450
+INFO:local_logger:Epoch[020/300], Step[0000/1602], Avg Loss: 5.1988, Avg Acc: 0.0450
+INFO:master_logger:Epoch[020/300], Step[0000/1602], Avg Loss: 4.8211, Avg Acc: 0.1550
+INFO:local_logger:Epoch[020/300], Step[0000/1602], Avg Loss: 5.0751, Avg Acc: 0.1250
+INFO:local_logger:Epoch[020/300], Step[0050/1602], Avg Loss: 4.5684, Avg Acc: 0.1383
+INFO:local_logger:Epoch[020/300], Step[0050/1602], Avg Loss: 4.5462, Avg Acc: 0.1736
+INFO:local_logger:Epoch[020/300], Step[0050/1602], Avg Loss: 4.5652, Avg Acc: 0.1536
+INFO:local_logger:Epoch[020/300], Step[0050/1602], Avg Loss: 4.5087, Avg Acc: 0.1615
+INFO:master_logger:Epoch[020/300], Step[0050/1602], Avg Loss: 4.5471, Avg Acc: 0.1568
+INFO:local_logger:Epoch[020/300], Step[0100/1602], Avg Loss: 4.5148, Avg Acc: 0.1608
+INFO:master_logger:Epoch[020/300], Step[0100/1602], Avg Loss: 4.5543, Avg Acc: 0.1505
+INFO:local_logger:Epoch[020/300], Step[0100/1602], Avg Loss: 4.5336, Avg Acc: 0.1413
+INFO:local_logger:Epoch[020/300], Step[0100/1602], Avg Loss: 4.5937, Avg Acc: 0.1459
+INFO:local_logger:Epoch[020/300], Step[0100/1602], Avg Loss: 4.5751, Avg Acc: 0.1539
+INFO:local_logger:Epoch[020/300], Step[0150/1602], Avg Loss: 4.5353, Avg Acc: 0.1571
+INFO:local_logger:Epoch[020/300], Step[0150/1602], Avg Loss: 4.5562, Avg Acc: 0.1560
+INFO:local_logger:Epoch[020/300], Step[0150/1602], Avg Loss: 4.5382, Avg Acc: 0.1460
+INFO:master_logger:Epoch[020/300], Step[0150/1602], Avg Loss: 4.5362, Avg Acc: 0.1515
+INFO:local_logger:Epoch[020/300], Step[0150/1602], Avg Loss: 4.5151, Avg Acc: 0.1470
+INFO:local_logger:Epoch[020/300], Step[0200/1602], Avg Loss: 4.5675, Avg Acc: 0.1555
+INFO:local_logger:Epoch[020/300], Step[0200/1602], Avg Loss: 4.5609, Avg Acc: 0.1552
+INFO:local_logger:Epoch[020/300], Step[0200/1602], Avg Loss: 4.4974, Avg Acc: 0.1511
+INFO:local_logger:Epoch[020/300], Step[0200/1602], Avg Loss: 4.5293, Avg Acc: 0.1535
+INFO:master_logger:Epoch[020/300], Step[0200/1602], Avg Loss: 4.5388, Avg Acc: 0.1538
+INFO:local_logger:Epoch[020/300], Step[0250/1602], Avg Loss: 4.5578, Avg Acc: 0.1542
+INFO:local_logger:Epoch[020/300], Step[0250/1602], Avg Loss: 4.5605, Avg Acc: 0.1607
+INFO:local_logger:Epoch[020/300], Step[0250/1602], Avg Loss: 4.5401, Avg Acc: 0.1532
+INFO:local_logger:Epoch[020/300], Step[0250/1602], Avg Loss: 4.5352, Avg Acc: 0.1530
+INFO:master_logger:Epoch[020/300], Step[0250/1602], Avg Loss: 4.5484, Avg Acc: 0.1553
+INFO:local_logger:Epoch[020/300], Step[0300/1602], Avg Loss: 4.5558, Avg Acc: 0.1540
+INFO:local_logger:Epoch[020/300], Step[0300/1602], Avg Loss: 4.5320, Avg Acc: 0.1604
+INFO:local_logger:Epoch[020/300], Step[0300/1602], Avg Loss: 4.5378, Avg Acc: 0.1565
+INFO:local_logger:Epoch[020/300], Step[0300/1602], Avg Loss: 4.5389, Avg Acc: 0.1512
+INFO:master_logger:Epoch[020/300], Step[0300/1602], Avg Loss: 4.5411, Avg Acc: 0.1555
+INFO:local_logger:Epoch[020/300], Step[0350/1602], Avg Loss: 4.5571, Avg Acc: 0.1554
+INFO:local_logger:Epoch[020/300], Step[0350/1602], Avg Loss: 4.5390, Avg Acc: 0.1596
+INFO:local_logger:Epoch[020/300], Step[0350/1602], Avg Loss: 4.5384, Avg Acc: 0.1563
+INFO:master_logger:Epoch[020/300], Step[0350/1602], Avg Loss: 4.5442, Avg Acc: 0.1553
+INFO:local_logger:Epoch[020/300], Step[0350/1602], Avg Loss: 4.5422, Avg Acc: 0.1497
+INFO:local_logger:Epoch[020/300], Step[0400/1602], Avg Loss: 4.5660, Avg Acc: 0.1538
+INFO:local_logger:Epoch[020/300], Step[0400/1602], Avg Loss: 4.5329, Avg Acc: 0.1489
+INFO:local_logger:Epoch[020/300], Step[0400/1602], Avg Loss: 4.5426, Avg Acc: 0.1569
+INFO:local_logger:Epoch[020/300], Step[0400/1602], Avg Loss: 4.5487, Avg Acc: 0.1562
+INFO:master_logger:Epoch[020/300], Step[0400/1602], Avg Loss: 4.5476, Avg Acc: 0.1539
+INFO:local_logger:Epoch[020/300], Step[0450/1602], Avg Loss: 4.5682, Avg Acc: 0.1534
+INFO:local_logger:Epoch[020/300], Step[0450/1602], Avg Loss: 4.5255, Avg Acc: 0.1491
+INFO:local_logger:Epoch[020/300], Step[0450/1602], Avg Loss: 4.5339, Avg Acc: 0.1584
+INFO:local_logger:Epoch[020/300], Step[0450/1602], Avg Loss: 4.5344, Avg Acc: 0.1560
+INFO:master_logger:Epoch[020/300], Step[0450/1602], Avg Loss: 4.5405, Avg Acc: 0.1542
+INFO:local_logger:Epoch[020/300], Step[0500/1602], Avg Loss: 4.5610, Avg Acc: 0.1537
+INFO:local_logger:Epoch[020/300], Step[0500/1602], Avg Loss: 4.5449, Avg Acc: 0.1479
+INFO:local_logger:Epoch[020/300], Step[0500/1602], Avg Loss: 4.5317, Avg Acc: 0.1590
+INFO:local_logger:Epoch[020/300], Step[0500/1602], Avg Loss: 4.5349, Avg Acc: 0.1563
+INFO:master_logger:Epoch[020/300], Step[0500/1602], Avg Loss: 4.5431, Avg Acc: 0.1542
+INFO:local_logger:Epoch[020/300], Step[0550/1602], Avg Loss: 4.5573, Avg Acc: 0.1545
+INFO:local_logger:Epoch[020/300], Step[0550/1602], Avg Loss: 4.5566, Avg Acc: 0.1478
+INFO:local_logger:Epoch[020/300], Step[0550/1602], Avg Loss: 4.5334, Avg Acc: 0.1555
+INFO:local_logger:Epoch[020/300], Step[0550/1602], Avg Loss: 4.5386, Avg Acc: 0.1564
+INFO:master_logger:Epoch[020/300], Step[0550/1602], Avg Loss: 4.5465, Avg Acc: 0.1535
+INFO:local_logger:Epoch[020/300], Step[0600/1602], Avg Loss: 4.5518, Avg Acc: 0.1558
+INFO:local_logger:Epoch[020/300], Step[0600/1602], Avg Loss: 4.5602, Avg Acc: 0.1472
+INFO:local_logger:Epoch[020/300], Step[0600/1602], Avg Loss: 4.5429, Avg Acc: 0.1554
+INFO:master_logger:Epoch[020/300], Step[0600/1602], Avg Loss: 4.5474, Avg Acc: 0.1534
+INFO:local_logger:Epoch[020/300], Step[0600/1602], Avg Loss: 4.5346, Avg Acc: 0.1553
+INFO:local_logger:Epoch[020/300], Step[0650/1602], Avg Loss: 4.5559, Avg Acc: 0.1547
+INFO:local_logger:Epoch[020/300], Step[0650/1602], Avg Loss: 4.5366, Avg Acc: 0.1562
+INFO:local_logger:Epoch[020/300], Step[0650/1602], Avg Loss: 4.5560, Avg Acc: 0.1482
+INFO:local_logger:Epoch[020/300], Step[0650/1602], Avg Loss: 4.5425, Avg Acc: 0.1566
+INFO:master_logger:Epoch[020/300], Step[0650/1602], Avg Loss: 4.5477, Avg Acc: 0.1539
+INFO:local_logger:Epoch[020/300], Step[0700/1602], Avg Loss: 4.5556, Avg Acc: 0.1535
+INFO:local_logger:Epoch[020/300], Step[0700/1602], Avg Loss: 4.5361, Avg Acc: 0.1571
+INFO:local_logger:Epoch[020/300], Step[0700/1602], Avg Loss: 4.5446, Avg Acc: 0.1547
+INFO:local_logger:Epoch[020/300], Step[0700/1602], Avg Loss: 4.5560, Avg Acc: 0.1485
+INFO:master_logger:Epoch[020/300], Step[0700/1602], Avg Loss: 4.5481, Avg Acc: 0.1535
+INFO:local_logger:Epoch[020/300], Step[0750/1602], Avg Loss: 4.5553, Avg Acc: 0.1519
+INFO:local_logger:Epoch[020/300], Step[0750/1602], Avg Loss: 4.5500, Avg Acc: 0.1526
+INFO:local_logger:Epoch[020/300], Step[0750/1602], Avg Loss: 4.5516, Avg Acc: 0.1492
+INFO:local_logger:Epoch[020/300], Step[0750/1602], Avg Loss: 4.5318, Avg Acc: 0.1578
+INFO:master_logger:Epoch[020/300], Step[0750/1602], Avg Loss: 4.5472, Avg Acc: 0.1529
+INFO:local_logger:Epoch[020/300], Step[0800/1602], Avg Loss: 4.5506, Avg Acc: 0.1535
+INFO:local_logger:Epoch[020/300], Step[0800/1602], Avg Loss: 4.5608, Avg Acc: 0.1508
+INFO:local_logger:Epoch[020/300], Step[0800/1602], Avg Loss: 4.5450, Avg Acc: 0.1492
+INFO:master_logger:Epoch[020/300], Step[0800/1602], Avg Loss: 4.5435, Avg Acc: 0.1528
+INFO:local_logger:Epoch[020/300], Step[0800/1602], Avg Loss: 4.5177, Avg Acc: 0.1579
+INFO:local_logger:Epoch[020/300], Step[0850/1602], Avg Loss: 4.5209, Avg Acc: 0.1574
+INFO:local_logger:Epoch[020/300], Step[0850/1602], Avg Loss: 4.5545, Avg Acc: 0.1519
+INFO:local_logger:Epoch[020/300], Step[0850/1602], Avg Loss: 4.5416, Avg Acc: 0.1489
+INFO:local_logger:Epoch[020/300], Step[0850/1602], Avg Loss: 4.5531, Avg Acc: 0.1539
+INFO:master_logger:Epoch[020/300], Step[0850/1602], Avg Loss: 4.5425, Avg Acc: 0.1530
+INFO:local_logger:Epoch[020/300], Step[0900/1602], Avg Loss: 4.5189, Avg Acc: 0.1579
+INFO:local_logger:Epoch[020/300], Step[0900/1602], Avg Loss: 4.5538, Avg Acc: 0.1513
+INFO:local_logger:Epoch[020/300], Step[0900/1602], Avg Loss: 4.5499, Avg Acc: 0.1525
+INFO:master_logger:Epoch[020/300], Step[0900/1602], Avg Loss: 4.5413, Avg Acc: 0.1526
+INFO:local_logger:Epoch[020/300], Step[0900/1602], Avg Loss: 4.5426, Avg Acc: 0.1485
+INFO:local_logger:Epoch[020/300], Step[0950/1602], Avg Loss: 4.5206, Avg Acc: 0.1570
+INFO:local_logger:Epoch[020/300], Step[0950/1602], Avg Loss: 4.5562, Avg Acc: 0.1523
+INFO:local_logger:Epoch[020/300], Step[0950/1602], Avg Loss: 4.5436, Avg Acc: 0.1481
+INFO:local_logger:Epoch[020/300], Step[0950/1602], Avg Loss: 4.5493, Avg Acc: 0.1521
+INFO:master_logger:Epoch[020/300], Step[0950/1602], Avg Loss: 4.5424, Avg Acc: 0.1524
+INFO:local_logger:Epoch[020/300], Step[1000/1602], Avg Loss: 4.5527, Avg Acc: 0.1529
+INFO:local_logger:Epoch[020/300], Step[1000/1602], Avg Loss: 4.5154, Avg Acc: 0.1575
+INFO:local_logger:Epoch[020/300], Step[1000/1602], Avg Loss: 4.5523, Avg Acc: 0.1517
+INFO:local_logger:Epoch[020/300], Step[1000/1602], Avg Loss: 4.5495, Avg Acc: 0.1484
+INFO:master_logger:Epoch[020/300], Step[1000/1602], Avg Loss: 4.5425, Avg Acc: 0.1526
+INFO:local_logger:Epoch[020/300], Step[1050/1602], Avg Loss: 4.5522, Avg Acc: 0.1517
+INFO:local_logger:Epoch[020/300], Step[1050/1602], Avg Loss: 4.5189, Avg Acc: 0.1570
+INFO:local_logger:Epoch[020/300], Step[1050/1602], Avg Loss: 4.5581, Avg Acc: 0.1513
+INFO:local_logger:Epoch[020/300], Step[1050/1602], Avg Loss: 4.5514, Avg Acc: 0.1499
+INFO:master_logger:Epoch[020/300], Step[1050/1602], Avg Loss: 4.5451, Avg Acc: 0.1525
+INFO:local_logger:Epoch[020/300], Step[1100/1602], Avg Loss: 4.5536, Avg Acc: 0.1487
+INFO:local_logger:Epoch[020/300], Step[1100/1602], Avg Loss: 4.5535, Avg Acc: 0.1516
+INFO:local_logger:Epoch[020/300], Step[1100/1602], Avg Loss: 4.5554, Avg Acc: 0.1523
+INFO:master_logger:Epoch[020/300], Step[1100/1602], Avg Loss: 4.5452, Avg Acc: 0.1525
+INFO:local_logger:Epoch[020/300], Step[1100/1602], Avg Loss: 4.5182, Avg Acc: 0.1577
+INFO:local_logger:Epoch[020/300], Step[1150/1602], Avg Loss: 4.5502, Avg Acc: 0.1516
+INFO:local_logger:Epoch[020/300], Step[1150/1602], Avg Loss: 4.5175, Avg Acc: 0.1573
+INFO:local_logger:Epoch[020/300], Step[1150/1602], Avg Loss: 4.5495, Avg Acc: 0.1487
+INFO:local_logger:Epoch[020/300], Step[1150/1602], Avg Loss: 4.5541, Avg Acc: 0.1530
+INFO:master_logger:Epoch[020/300], Step[1150/1602], Avg Loss: 4.5428, Avg Acc: 0.1527
+INFO:local_logger:Epoch[020/300], Step[1200/1602], Avg Loss: 4.5482, Avg Acc: 0.1513
+INFO:local_logger:Epoch[020/300], Step[1200/1602], Avg Loss: 4.5200, Avg Acc: 0.1569
+INFO:local_logger:Epoch[020/300], Step[1200/1602], Avg Loss: 4.5558, Avg Acc: 0.1531
+INFO:local_logger:Epoch[020/300], Step[1200/1602], Avg Loss: 4.5523, Avg Acc: 0.1481
+INFO:master_logger:Epoch[020/300], Step[1200/1602], Avg Loss: 4.5441, Avg Acc: 0.1524
+INFO:local_logger:Epoch[020/300], Step[1250/1602], Avg Loss: 4.5474, Avg Acc: 0.1519
+INFO:local_logger:Epoch[020/300], Step[1250/1602], Avg Loss: 4.5501, Avg Acc: 0.1489
+INFO:local_logger:Epoch[020/300], Step[1250/1602], Avg Loss: 4.5193, Avg Acc: 0.1564
+INFO:master_logger:Epoch[020/300], Step[1250/1602], Avg Loss: 4.5429, Avg Acc: 0.1526
+INFO:local_logger:Epoch[020/300], Step[1250/1602], Avg Loss: 4.5547, Avg Acc: 0.1531
+INFO:local_logger:Epoch[020/300], Step[1300/1602], Avg Loss: 4.5429, Avg Acc: 0.1520
+INFO:local_logger:Epoch[020/300], Step[1300/1602], Avg Loss: 4.5492, Avg Acc: 0.1550
+INFO:local_logger:Epoch[020/300], Step[1300/1602], Avg Loss: 4.5212, Avg Acc: 0.1566
+INFO:local_logger:Epoch[020/300], Step[1300/1602], Avg Loss: 4.5491, Avg Acc: 0.1489
+INFO:master_logger:Epoch[020/300], Step[1300/1602], Avg Loss: 4.5406, Avg Acc: 0.1531
+INFO:local_logger:Epoch[020/300], Step[1350/1602], Avg Loss: 4.5225, Avg Acc: 0.1560
+INFO:local_logger:Epoch[020/300], Step[1350/1602], Avg Loss: 4.5461, Avg Acc: 0.1500
+INFO:local_logger:Epoch[020/300], Step[1350/1602], Avg Loss: 4.5429, Avg Acc: 0.1520
+INFO:local_logger:Epoch[020/300], Step[1350/1602], Avg Loss: 4.5489, Avg Acc: 0.1547
+INFO:master_logger:Epoch[020/300], Step[1350/1602], Avg Loss: 4.5401, Avg Acc: 0.1532
+INFO:local_logger:Epoch[020/300], Step[1400/1602], Avg Loss: 4.5399, Avg Acc: 0.1531
+INFO:local_logger:Epoch[020/300], Step[1400/1602], Avg Loss: 4.5252, Avg Acc: 0.1558
+INFO:local_logger:Epoch[020/300], Step[1400/1602], Avg Loss: 4.5416, Avg Acc: 0.1507
+INFO:local_logger:Epoch[020/300], Step[1400/1602], Avg Loss: 4.5484, Avg Acc: 0.1553
+INFO:master_logger:Epoch[020/300], Step[1400/1602], Avg Loss: 4.5387, Avg Acc: 0.1537
+INFO:local_logger:Epoch[020/300], Step[1450/1602], Avg Loss: 4.5366, Avg Acc: 0.1535
+INFO:local_logger:Epoch[020/300], Step[1450/1602], Avg Loss: 4.5406, Avg Acc: 0.1508
+INFO:local_logger:Epoch[020/300], Step[1450/1602], Avg Loss: 4.5191, Avg Acc: 0.1558
+INFO:local_logger:Epoch[020/300], Step[1450/1602], Avg Loss: 4.5440, Avg Acc: 0.1566
+INFO:master_logger:Epoch[020/300], Step[1450/1602], Avg Loss: 4.5351, Avg Acc: 0.1542
+INFO:local_logger:Epoch[020/300], Step[1500/1602], Avg Loss: 4.5376, Avg Acc: 0.1537
+INFO:local_logger:Epoch[020/300], Step[1500/1602], Avg Loss: 4.5152, Avg Acc: 0.1565
+INFO:local_logger:Epoch[020/300], Step[1500/1602], Avg Loss: 4.5344, Avg Acc: 0.1518
+INFO:master_logger:Epoch[020/300], Step[1500/1602], Avg Loss: 4.5318, Avg Acc: 0.1546
+INFO:local_logger:Epoch[020/300], Step[1500/1602], Avg Loss: 4.5400, Avg Acc: 0.1563
+INFO:local_logger:Epoch[020/300], Step[1550/1602], Avg Loss: 4.5353, Avg Acc: 0.1544
+INFO:local_logger:Epoch[020/300], Step[1550/1602], Avg Loss: 4.5353, Avg Acc: 0.1508
+INFO:local_logger:Epoch[020/300], Step[1550/1602], Avg Loss: 4.5374, Avg Acc: 0.1568
+INFO:master_logger:Epoch[020/300], Step[1550/1602], Avg Loss: 4.5303, Avg Acc: 0.1548
+INFO:local_logger:Epoch[020/300], Step[1550/1602], Avg Loss: 4.5132, Avg Acc: 0.1572
+INFO:local_logger:Epoch[020/300], Step[1600/1602], Avg Loss: 4.5389, Avg Acc: 0.1542
+INFO:local_logger:Epoch[020/300], Step[1600/1602], Avg Loss: 4.5349, Avg Acc: 0.1513
+INFO:master_logger:Epoch[020/300], Step[1600/1602], Avg Loss: 4.5306, Avg Acc: 0.1549
+INFO:local_logger:Epoch[020/300], Step[1600/1602], Avg Loss: 4.5108, Avg Acc: 0.1580
+INFO:local_logger:Epoch[020/300], Step[1600/1602], Avg Loss: 4.5376, Avg Acc: 0.1561
+INFO:local_logger:----- Epoch[020/300], Train Loss: 4.5377, Train Acc: 0.1561, time: 3700.21
+INFO:local_logger:----- Validation after Epoch: 20
+INFO:local_logger:----- Epoch[020/300], Train Loss: 4.5107, Train Acc: 0.1580, time: 3700.30
+INFO:local_logger:----- Validation after Epoch: 20
+INFO:local_logger:----- Epoch[020/300], Train Loss: 4.5349, Train Acc: 0.1513, time: 3700.26
+INFO:local_logger:----- Validation after Epoch: 20
+INFO:local_logger:----- Epoch[020/300], Train Loss: 4.5389, Train Acc: 0.1542, time: 3700.01
+INFO:master_logger:----- Epoch[020/300], Train Loss: 4.5306, Train Acc: 0.1549, time: 3700.01
+INFO:local_logger:----- Validation after Epoch: 20
+INFO:master_logger:----- Validation after Epoch: 20
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.7232, Avg Acc@1: 1.0000, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 1.4205, Avg Acc@1: 0.8750, Avg Acc@5: 0.8750
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 1.5835, Avg Acc@1: 0.7500, Avg Acc@5: 0.8750
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.2749, Avg Acc@1: 1.0000, Avg Acc@5: 1.0000
+INFO:master_logger:Val Step[0000/1563], Avg Loss: 1.0005, Avg Acc@1: 0.9062, Avg Acc@5: 0.9375
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.6972, Avg Acc@1: 0.6176, Avg Acc@5: 0.8431
+INFO:master_logger:Val Step[0050/1563], Avg Loss: 1.5864, Avg Acc@1: 0.6366, Avg Acc@5: 0.8480
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.5116, Avg Acc@1: 0.6397, Avg Acc@5: 0.8603
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.5187, Avg Acc@1: 0.6422, Avg Acc@5: 0.8701
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.6178, Avg Acc@1: 0.6471, Avg Acc@5: 0.8186
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.0117, Avg Acc@1: 0.5309, Avg Acc@5: 0.7871
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.0840, Avg Acc@1: 0.5421, Avg Acc@5: 0.7599
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.1211, Avg Acc@1: 0.5322, Avg Acc@5: 0.7599
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 2.1270, Avg Acc@1: 0.5062, Avg Acc@5: 0.7587
+INFO:master_logger:Val Step[0100/1563], Avg Loss: 2.0859, Avg Acc@1: 0.5278, Avg Acc@5: 0.7664
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.8899, Avg Acc@1: 0.5629, Avg Acc@5: 0.7964
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.9875, Avg Acc@1: 0.5621, Avg Acc@5: 0.7806
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.9230, Avg Acc@1: 0.5671, Avg Acc@5: 0.7831
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 2.0113, Avg Acc@1: 0.5356, Avg Acc@5: 0.7790
+INFO:master_logger:Val Step[0150/1563], Avg Loss: 1.9529, Avg Acc@1: 0.5569, Avg Acc@5: 0.7848
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.9343, Avg Acc@1: 0.5560, Avg Acc@5: 0.7966
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.9594, Avg Acc@1: 0.5684, Avg Acc@5: 0.7848
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 2.0367, Avg Acc@1: 0.5348, Avg Acc@5: 0.7799
+INFO:master_logger:Val Step[0200/1563], Avg Loss: 1.9845, Avg Acc@1: 0.5539, Avg Acc@5: 0.7848
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 2.0075, Avg Acc@1: 0.5566, Avg Acc@5: 0.7780
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.9507, Avg Acc@1: 0.5458, Avg Acc@5: 0.7933
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.8735, Avg Acc@1: 0.5702, Avg Acc@5: 0.8038
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.8728, Avg Acc@1: 0.5807, Avg Acc@5: 0.7973
+INFO:master_logger:Val Step[0250/1563], Avg Loss: 1.9093, Avg Acc@1: 0.5664, Avg Acc@5: 0.7957
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.9402, Avg Acc@1: 0.5687, Avg Acc@5: 0.7883
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 2.0342, Avg Acc@1: 0.5216, Avg Acc@5: 0.7762
+INFO:master_logger:Val Step[0300/1563], Avg Loss: 2.0005, Avg Acc@1: 0.5372, Avg Acc@5: 0.7834
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.9689, Avg Acc@1: 0.5365, Avg Acc@5: 0.7915
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 2.0273, Avg Acc@1: 0.5378, Avg Acc@5: 0.7811
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.9716, Avg Acc@1: 0.5527, Avg Acc@5: 0.7849
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.9776, Avg Acc@1: 0.5449, Avg Acc@5: 0.7877
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.0099, Avg Acc@1: 0.5317, Avg Acc@5: 0.7845
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.0412, Avg Acc@1: 0.5335, Avg Acc@5: 0.7831
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 2.0445, Avg Acc@1: 0.5157, Avg Acc@5: 0.7771
+INFO:master_logger:Val Step[0350/1563], Avg Loss: 2.0183, Avg Acc@1: 0.5314, Avg Acc@5: 0.7831
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.0568, Avg Acc@1: 0.5094, Avg Acc@5: 0.7784
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.0473, Avg Acc@1: 0.5237, Avg Acc@5: 0.7818
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 2.0261, Avg Acc@1: 0.5184, Avg Acc@5: 0.7865
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.9881, Avg Acc@1: 0.5374, Avg Acc@5: 0.7868
+INFO:master_logger:Val Step[0400/1563], Avg Loss: 2.0296, Avg Acc@1: 0.5222, Avg Acc@5: 0.7834
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.0484, Avg Acc@1: 0.5133, Avg Acc@5: 0.7860
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.0660, Avg Acc@1: 0.5047, Avg Acc@5: 0.7772
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.0216, Avg Acc@1: 0.5272, Avg Acc@5: 0.7819
+INFO:master_logger:Val Step[0450/1563], Avg Loss: 2.0501, Avg Acc@1: 0.5166, Avg Acc@5: 0.7811
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 2.0643, Avg Acc@1: 0.5211, Avg Acc@5: 0.7794
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.0385, Avg Acc@1: 0.5122, Avg Acc@5: 0.7812
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.0041, Avg Acc@1: 0.5282, Avg Acc@5: 0.7857
+INFO:master_logger:Val Step[0500/1563], Avg Loss: 2.0340, Avg Acc@1: 0.5194, Avg Acc@5: 0.7836
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.0574, Avg Acc@1: 0.5210, Avg Acc@5: 0.7794
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 2.0360, Avg Acc@1: 0.5162, Avg Acc@5: 0.7879
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.9584, Avg Acc@1: 0.5379, Avg Acc@5: 0.7920
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.0163, Avg Acc@1: 0.5245, Avg Acc@5: 0.7897
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.0152, Avg Acc@1: 0.5172, Avg Acc@5: 0.7852
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 2.0268, Avg Acc@1: 0.5304, Avg Acc@5: 0.7856
+INFO:master_logger:Val Step[0550/1563], Avg Loss: 2.0042, Avg Acc@1: 0.5275, Avg Acc@5: 0.7881
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.0152, Avg Acc@1: 0.5268, Avg Acc@5: 0.7872
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.0220, Avg Acc@1: 0.5198, Avg Acc@5: 0.7841
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.9689, Avg Acc@1: 0.5360, Avg Acc@5: 0.7897
+INFO:master_logger:Val Step[0600/1563], Avg Loss: 2.0077, Avg Acc@1: 0.5281, Avg Acc@5: 0.7872
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 2.0248, Avg Acc@1: 0.5297, Avg Acc@5: 0.7876
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.0337, Avg Acc@1: 0.5207, Avg Acc@5: 0.7817
+INFO:master_logger:Val Step[0650/1563], Avg Loss: 2.0286, Avg Acc@1: 0.5262, Avg Acc@5: 0.7847
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.9880, Avg Acc@1: 0.5340, Avg Acc@5: 0.7890
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.0532, Avg Acc@1: 0.5252, Avg Acc@5: 0.7849
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 2.0396, Avg Acc@1: 0.5250, Avg Acc@5: 0.7830
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 2.0547, Avg Acc@1: 0.5239, Avg Acc@5: 0.7789
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 2.1099, Avg Acc@1: 0.5150, Avg Acc@5: 0.7746
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 2.1032, Avg Acc@1: 0.5144, Avg Acc@5: 0.7732
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 2.0946, Avg Acc@1: 0.5103, Avg Acc@5: 0.7701
+INFO:master_logger:Val Step[0700/1563], Avg Loss: 2.0906, Avg Acc@1: 0.5159, Avg Acc@5: 0.7742
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 2.1182, Avg Acc@1: 0.5135, Avg Acc@5: 0.7681
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 2.1432, Avg Acc@1: 0.5035, Avg Acc@5: 0.7607
+INFO:master_logger:Val Step[0750/1563], Avg Loss: 2.1469, Avg Acc@1: 0.5062, Avg Acc@5: 0.7633
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 2.1555, Avg Acc@1: 0.5052, Avg Acc@5: 0.7623
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 2.1709, Avg Acc@1: 0.5028, Avg Acc@5: 0.7620
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 2.2172, Avg Acc@1: 0.4909, Avg Acc@5: 0.7478
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 2.1809, Avg Acc@1: 0.5025, Avg Acc@5: 0.7583
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 2.2134, Avg Acc@1: 0.4966, Avg Acc@5: 0.7534
+INFO:master_logger:Val Step[0800/1563], Avg Loss: 2.2089, Avg Acc@1: 0.4956, Avg Acc@5: 0.7532
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 2.2241, Avg Acc@1: 0.4922, Avg Acc@5: 0.7533
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 2.2495, Avg Acc@1: 0.4853, Avg Acc@5: 0.7434
+INFO:master_logger:Val Step[0850/1563], Avg Loss: 2.2516, Avg Acc@1: 0.4886, Avg Acc@5: 0.7465
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 2.2610, Avg Acc@1: 0.4868, Avg Acc@5: 0.7488
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 2.2311, Avg Acc@1: 0.4943, Avg Acc@5: 0.7488
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 2.2649, Avg Acc@1: 0.4881, Avg Acc@5: 0.7452
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 2.2371, Avg Acc@1: 0.4946, Avg Acc@5: 0.7465
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 2.2832, Avg Acc@1: 0.4861, Avg Acc@5: 0.7407
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 2.2716, Avg Acc@1: 0.4865, Avg Acc@5: 0.7463
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 2.2528, Avg Acc@1: 0.4868, Avg Acc@5: 0.7432
+INFO:master_logger:Val Step[0900/1563], Avg Loss: 2.2612, Avg Acc@1: 0.4885, Avg Acc@5: 0.7442
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 2.2942, Avg Acc@1: 0.4816, Avg Acc@5: 0.7349
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 2.3305, Avg Acc@1: 0.4799, Avg Acc@5: 0.7315
+INFO:master_logger:Val Step[0950/1563], Avg Loss: 2.3034, Avg Acc@1: 0.4824, Avg Acc@5: 0.7364
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 2.3098, Avg Acc@1: 0.4808, Avg Acc@5: 0.7400
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 2.2793, Avg Acc@1: 0.4874, Avg Acc@5: 0.7391
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 2.3406, Avg Acc@1: 0.4760, Avg Acc@5: 0.7350
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 2.3397, Avg Acc@1: 0.4742, Avg Acc@5: 0.7279
+INFO:master_logger:Val Step[1000/1563], Avg Loss: 2.3389, Avg Acc@1: 0.4762, Avg Acc@5: 0.7306
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 2.3586, Avg Acc@1: 0.4738, Avg Acc@5: 0.7264
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 2.3169, Avg Acc@1: 0.4808, Avg Acc@5: 0.7330
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 2.3819, Avg Acc@1: 0.4682, Avg Acc@5: 0.7223
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 2.3572, Avg Acc@1: 0.4723, Avg Acc@5: 0.7333
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 2.3641, Avg Acc@1: 0.4687, Avg Acc@5: 0.7250
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 2.3399, Avg Acc@1: 0.4763, Avg Acc@5: 0.7286
+INFO:master_logger:Val Step[1050/1563], Avg Loss: 2.3608, Avg Acc@1: 0.4714, Avg Acc@5: 0.7273
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.3850, Avg Acc@1: 0.4684, Avg Acc@5: 0.7282
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.3745, Avg Acc@1: 0.4720, Avg Acc@5: 0.7220
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.3899, Avg Acc@1: 0.4640, Avg Acc@5: 0.7203
+INFO:master_logger:Val Step[1100/1563], Avg Loss: 2.3900, Avg Acc@1: 0.4671, Avg Acc@5: 0.7217
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.4105, Avg Acc@1: 0.4639, Avg Acc@5: 0.7165
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.4178, Avg Acc@1: 0.4597, Avg Acc@5: 0.7159
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.4141, Avg Acc@1: 0.4639, Avg Acc@5: 0.7224
+INFO:master_logger:Val Step[1150/1563], Avg Loss: 2.4204, Avg Acc@1: 0.4623, Avg Acc@5: 0.7160
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.4415, Avg Acc@1: 0.4597, Avg Acc@5: 0.7106
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.4082, Avg Acc@1: 0.4659, Avg Acc@5: 0.7150
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.4413, Avg Acc@1: 0.4610, Avg Acc@5: 0.7091
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.4701, Avg Acc@1: 0.4556, Avg Acc@5: 0.7060
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.4373, Avg Acc@1: 0.4603, Avg Acc@5: 0.7179
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.4475, Avg Acc@1: 0.4560, Avg Acc@5: 0.7111
+INFO:master_logger:Val Step[1200/1563], Avg Loss: 2.4491, Avg Acc@1: 0.4582, Avg Acc@5: 0.7110
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.4886, Avg Acc@1: 0.4533, Avg Acc@5: 0.7027
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.4553, Avg Acc@1: 0.4576, Avg Acc@5: 0.7142
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.4715, Avg Acc@1: 0.4574, Avg Acc@5: 0.7037
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.4700, Avg Acc@1: 0.4520, Avg Acc@5: 0.7066
+INFO:master_logger:Val Step[1250/1563], Avg Loss: 2.4713, Avg Acc@1: 0.4551, Avg Acc@5: 0.7068
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.4993, Avg Acc@1: 0.4533, Avg Acc@5: 0.6980
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.5007, Avg Acc@1: 0.4504, Avg Acc@5: 0.7012
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.4839, Avg Acc@1: 0.4486, Avg Acc@5: 0.7055
+INFO:master_logger:Val Step[1300/1563], Avg Loss: 2.4899, Avg Acc@1: 0.4515, Avg Acc@5: 0.7039
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.4756, Avg Acc@1: 0.4535, Avg Acc@5: 0.7109
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.5080, Avg Acc@1: 0.4484, Avg Acc@5: 0.7049
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.5174, Avg Acc@1: 0.4425, Avg Acc@5: 0.6998
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.5256, Avg Acc@1: 0.4481, Avg Acc@5: 0.6938
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.5375, Avg Acc@1: 0.4435, Avg Acc@5: 0.6949
+INFO:master_logger:Val Step[1350/1563], Avg Loss: 2.5221, Avg Acc@1: 0.4456, Avg Acc@5: 0.6984
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.5345, Avg Acc@1: 0.4378, Avg Acc@5: 0.6969
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.5240, Avg Acc@1: 0.4456, Avg Acc@5: 0.7024
+INFO:master_logger:Val Step[1400/1563], Avg Loss: 2.5366, Avg Acc@1: 0.4423, Avg Acc@5: 0.6962
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.5371, Avg Acc@1: 0.4458, Avg Acc@5: 0.6919
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.5509, Avg Acc@1: 0.4398, Avg Acc@5: 0.6935
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.5387, Avg Acc@1: 0.4459, Avg Acc@5: 0.6917
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.5286, Avg Acc@1: 0.4442, Avg Acc@5: 0.7012
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.5585, Avg Acc@1: 0.4394, Avg Acc@5: 0.6921
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.5338, Avg Acc@1: 0.4376, Avg Acc@5: 0.6965
+INFO:master_logger:Val Step[1450/1563], Avg Loss: 2.5399, Avg Acc@1: 0.4418, Avg Acc@5: 0.6954
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.4970, Avg Acc@1: 0.4509, Avg Acc@5: 0.7067
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.5130, Avg Acc@1: 0.4415, Avg Acc@5: 0.7008
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.5353, Avg Acc@1: 0.4442, Avg Acc@5: 0.6956
+INFO:master_logger:Val Step[1500/1563], Avg Loss: 2.5149, Avg Acc@1: 0.4467, Avg Acc@5: 0.6998
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.5143, Avg Acc@1: 0.4502, Avg Acc@5: 0.6962
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.4945, Avg Acc@1: 0.4458, Avg Acc@5: 0.7040
+INFO:master_logger:Val Step[1550/1563], Avg Loss: 2.5018, Avg Acc@1: 0.4496, Avg Acc@5: 0.7019
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.5188, Avg Acc@1: 0.4480, Avg Acc@5: 0.6977
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.4881, Avg Acc@1: 0.4526, Avg Acc@5: 0.7083
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.5058, Avg Acc@1: 0.4520, Avg Acc@5: 0.6977
+INFO:local_logger:----- Epoch[020/300], Validation Loss: 2.4847, Validation Acc@1: 0.4537, Validation Acc@5: 0.7090, time: 178.98
+INFO:local_logger:Now training epoch 21. LR=0.000388
+INFO:local_logger:----- Epoch[020/300], Validation Loss: 2.5017, Validation Acc@1: 0.4533, Validation Acc@5: 0.6980, time: 178.98
+INFO:local_logger:Now training epoch 21. LR=0.000388
+INFO:local_logger:----- Epoch[020/300], Validation Loss: 2.4884, Validation Acc@1: 0.4470, Validation Acc@5: 0.7047, time: 178.92
+INFO:master_logger:----- Epoch[020/300], Validation Loss: 2.4969, Validation Acc@1: 0.4509, Validation Acc@5: 0.7026, time: 178.92
+INFO:local_logger:----- Epoch[020/300], Validation Loss: 2.5129, Validation Acc@1: 0.4497, Validation Acc@5: 0.6985, time: 178.93
+INFO:local_logger:Now training epoch 21. LR=0.000388
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-20-Loss-4.5389416791243615.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-20-Loss-4.5389416791243615.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-20-Loss-4.5389416791243615-EMA.pdparams
+INFO:local_logger:Now training epoch 21. LR=0.000388
+INFO:master_logger:Now training epoch 21. LR=0.000388
+INFO:local_logger:Epoch[021/300], Step[0000/1602], Avg Loss: 4.8448, Avg Acc: 0.0500
+INFO:local_logger:Epoch[021/300], Step[0000/1602], Avg Loss: 3.8278, Avg Acc: 0.3400
+INFO:master_logger:Epoch[021/300], Step[0000/1602], Avg Loss: 4.3594, Avg Acc: 0.2150
+INFO:local_logger:Epoch[021/300], Step[0000/1602], Avg Loss: 5.1663, Avg Acc: 0.0900
+INFO:local_logger:Epoch[021/300], Step[0000/1602], Avg Loss: 3.5985, Avg Acc: 0.3800
+INFO:local_logger:Epoch[021/300], Step[0050/1602], Avg Loss: 4.6694, Avg Acc: 0.1420
+INFO:local_logger:Epoch[021/300], Step[0050/1602], Avg Loss: 4.4793, Avg Acc: 0.1669
+INFO:local_logger:Epoch[021/300], Step[0050/1602], Avg Loss: 4.4584, Avg Acc: 0.1782
+INFO:local_logger:Epoch[021/300], Step[0050/1602], Avg Loss: 4.5602, Avg Acc: 0.1436
+INFO:master_logger:Epoch[021/300], Step[0050/1602], Avg Loss: 4.5418, Avg Acc: 0.1577
+INFO:local_logger:Epoch[021/300], Step[0100/1602], Avg Loss: 4.5226, Avg Acc: 0.1730
+INFO:local_logger:Epoch[021/300], Step[0100/1602], Avg Loss: 4.5391, Avg Acc: 0.1446
+INFO:local_logger:Epoch[021/300], Step[0100/1602], Avg Loss: 4.6247, Avg Acc: 0.1497
+INFO:local_logger:Epoch[021/300], Step[0100/1602], Avg Loss: 4.4576, Avg Acc: 0.1687
+INFO:master_logger:Epoch[021/300], Step[0100/1602], Avg Loss: 4.5360, Avg Acc: 0.1590
+INFO:local_logger:Epoch[021/300], Step[0150/1602], Avg Loss: 4.5074, Avg Acc: 0.1528
+INFO:local_logger:Epoch[021/300], Step[0150/1602], Avg Loss: 4.5335, Avg Acc: 0.1629
+INFO:local_logger:Epoch[021/300], Step[0150/1602], Avg Loss: 4.5973, Avg Acc: 0.1402
+INFO:local_logger:Epoch[021/300], Step[0150/1602], Avg Loss: 4.5710, Avg Acc: 0.1509
+INFO:master_logger:Epoch[021/300], Step[0150/1602], Avg Loss: 4.5523, Avg Acc: 0.1517
+INFO:local_logger:Epoch[021/300], Step[0200/1602], Avg Loss: 4.4599, Avg Acc: 0.1574
+INFO:local_logger:Epoch[021/300], Step[0200/1602], Avg Loss: 4.5867, Avg Acc: 0.1506
+INFO:local_logger:Epoch[021/300], Step[0200/1602], Avg Loss: 4.5580, Avg Acc: 0.1590
+INFO:local_logger:Epoch[021/300], Step[0200/1602], Avg Loss: 4.5263, Avg Acc: 0.1427
+INFO:master_logger:Epoch[021/300], Step[0200/1602], Avg Loss: 4.5327, Avg Acc: 0.1524
+INFO:local_logger:Epoch[021/300], Step[0250/1602], Avg Loss: 4.5299, Avg Acc: 0.1486
+INFO:local_logger:Epoch[021/300], Step[0250/1602], Avg Loss: 4.4975, Avg Acc: 0.1538
+INFO:local_logger:Epoch[021/300], Step[0250/1602], Avg Loss: 4.5156, Avg Acc: 0.1610
+INFO:local_logger:Epoch[021/300], Step[0250/1602], Avg Loss: 4.5716, Avg Acc: 0.1526
+INFO:master_logger:Epoch[021/300], Step[0250/1602], Avg Loss: 4.5286, Avg Acc: 0.1540
+INFO:local_logger:Epoch[021/300], Step[0300/1602], Avg Loss: 4.4891, Avg Acc: 0.1517
+INFO:local_logger:Epoch[021/300], Step[0300/1602], Avg Loss: 4.5139, Avg Acc: 0.1517
+INFO:local_logger:Epoch[021/300], Step[0300/1602], Avg Loss: 4.5191, Avg Acc: 0.1623
+INFO:local_logger:Epoch[021/300], Step[0300/1602], Avg Loss: 4.5601, Avg Acc: 0.1504
+INFO:master_logger:Epoch[021/300], Step[0300/1602], Avg Loss: 4.5205, Avg Acc: 0.1541
+INFO:local_logger:Epoch[021/300], Step[0350/1602], Avg Loss: 4.4967, Avg Acc: 0.1491
+INFO:master_logger:Epoch[021/300], Step[0350/1602], Avg Loss: 4.5103, Avg Acc: 0.1561
+INFO:local_logger:Epoch[021/300], Step[0350/1602], Avg Loss: 4.5256, Avg Acc: 0.1566
+INFO:local_logger:Epoch[021/300], Step[0350/1602], Avg Loss: 4.5149, Avg Acc: 0.1639
+INFO:local_logger:Epoch[021/300], Step[0350/1602], Avg Loss: 4.5041, Avg Acc: 0.1549
+INFO:local_logger:Epoch[021/300], Step[0400/1602], Avg Loss: 4.4848, Avg Acc: 0.1502
+INFO:local_logger:Epoch[021/300], Step[0400/1602], Avg Loss: 4.5079, Avg Acc: 0.1652
+INFO:local_logger:Epoch[021/300], Step[0400/1602], Avg Loss: 4.5180, Avg Acc: 0.1575
+INFO:master_logger:Epoch[021/300], Step[0400/1602], Avg Loss: 4.5056, Avg Acc: 0.1569
+INFO:local_logger:Epoch[021/300], Step[0400/1602], Avg Loss: 4.5118, Avg Acc: 0.1549
+INFO:local_logger:Epoch[021/300], Step[0450/1602], Avg Loss: 4.4977, Avg Acc: 0.1519
+INFO:local_logger:Epoch[021/300], Step[0450/1602], Avg Loss: 4.4992, Avg Acc: 0.1659
+INFO:local_logger:Epoch[021/300], Step[0450/1602], Avg Loss: 4.5037, Avg Acc: 0.1530
+INFO:local_logger:Epoch[021/300], Step[0450/1602], Avg Loss: 4.5069, Avg Acc: 0.1593
+INFO:master_logger:Epoch[021/300], Step[0450/1602], Avg Loss: 4.5019, Avg Acc: 0.1575
+INFO:local_logger:Epoch[021/300], Step[0500/1602], Avg Loss: 4.4851, Avg Acc: 0.1511
+INFO:master_logger:Epoch[021/300], Step[0500/1602], Avg Loss: 4.4927, Avg Acc: 0.1585
+INFO:local_logger:Epoch[021/300], Step[0500/1602], Avg Loss: 4.5064, Avg Acc: 0.1648
+INFO:local_logger:Epoch[021/300], Step[0500/1602], Avg Loss: 4.4910, Avg Acc: 0.1622
+INFO:local_logger:Epoch[021/300], Step[0500/1602], Avg Loss: 4.4882, Avg Acc: 0.1559
+INFO:local_logger:Epoch[021/300], Step[0550/1602], Avg Loss: 4.4819, Avg Acc: 0.1521
+INFO:local_logger:Epoch[021/300], Step[0550/1602], Avg Loss: 4.4815, Avg Acc: 0.1636
+INFO:local_logger:Epoch[021/300], Step[0550/1602], Avg Loss: 4.5023, Avg Acc: 0.1636
+INFO:local_logger:Epoch[021/300], Step[0550/1602], Avg Loss: 4.4898, Avg Acc: 0.1550
+INFO:master_logger:Epoch[021/300], Step[0550/1602], Avg Loss: 4.4889, Avg Acc: 0.1586
+INFO:local_logger:Epoch[021/300], Step[0600/1602], Avg Loss: 4.5054, Avg Acc: 0.1631
+INFO:local_logger:Epoch[021/300], Step[0600/1602], Avg Loss: 4.4881, Avg Acc: 0.1626
+INFO:local_logger:Epoch[021/300], Step[0600/1602], Avg Loss: 4.4737, Avg Acc: 0.1572
+INFO:local_logger:Epoch[021/300], Step[0600/1602], Avg Loss: 4.4907, Avg Acc: 0.1504
+INFO:master_logger:Epoch[021/300], Step[0600/1602], Avg Loss: 4.4895, Avg Acc: 0.1583
+INFO:local_logger:Epoch[021/300], Step[0650/1602], Avg Loss: 4.4839, Avg Acc: 0.1500
+INFO:local_logger:Epoch[021/300], Step[0650/1602], Avg Loss: 4.4922, Avg Acc: 0.1640
+INFO:local_logger:Epoch[021/300], Step[0650/1602], Avg Loss: 4.5085, Avg Acc: 0.1628
+INFO:local_logger:Epoch[021/300], Step[0650/1602], Avg Loss: 4.4745, Avg Acc: 0.1573
+INFO:master_logger:Epoch[021/300], Step[0650/1602], Avg Loss: 4.4898, Avg Acc: 0.1585
+INFO:local_logger:Epoch[021/300], Step[0700/1602], Avg Loss: 4.4726, Avg Acc: 0.1582
+INFO:local_logger:Epoch[021/300], Step[0700/1602], Avg Loss: 4.4808, Avg Acc: 0.1492
+INFO:local_logger:Epoch[021/300], Step[0700/1602], Avg Loss: 4.5068, Avg Acc: 0.1635
+INFO:local_logger:Epoch[021/300], Step[0700/1602], Avg Loss: 4.4865, Avg Acc: 0.1622
+INFO:master_logger:Epoch[021/300], Step[0700/1602], Avg Loss: 4.4867, Avg Acc: 0.1583
+INFO:local_logger:Epoch[021/300], Step[0750/1602], Avg Loss: 4.4870, Avg Acc: 0.1628
+INFO:local_logger:Epoch[021/300], Step[0750/1602], Avg Loss: 4.4793, Avg Acc: 0.1493
+INFO:local_logger:Epoch[021/300], Step[0750/1602], Avg Loss: 4.4686, Avg Acc: 0.1589
+INFO:master_logger:Epoch[021/300], Step[0750/1602], Avg Loss: 4.4843, Avg Acc: 0.1587
+INFO:local_logger:Epoch[021/300], Step[0750/1602], Avg Loss: 4.5022, Avg Acc: 0.1639
+INFO:local_logger:Epoch[021/300], Step[0800/1602], Avg Loss: 4.4785, Avg Acc: 0.1504
+INFO:local_logger:Epoch[021/300], Step[0800/1602], Avg Loss: 4.4776, Avg Acc: 0.1635
+INFO:local_logger:Epoch[021/300], Step[0800/1602], Avg Loss: 4.4959, Avg Acc: 0.1633
+INFO:master_logger:Epoch[021/300], Step[0800/1602], Avg Loss: 4.4797, Avg Acc: 0.1592
+INFO:local_logger:Epoch[021/300], Step[0800/1602], Avg Loss: 4.4669, Avg Acc: 0.1596
+INFO:local_logger:Epoch[021/300], Step[0850/1602], Avg Loss: 4.4788, Avg Acc: 0.1515
+INFO:local_logger:Epoch[021/300], Step[0850/1602], Avg Loss: 4.4870, Avg Acc: 0.1623
+INFO:local_logger:Epoch[021/300], Step[0850/1602], Avg Loss: 4.4973, Avg Acc: 0.1622
+INFO:local_logger:Epoch[021/300], Step[0850/1602], Avg Loss: 4.4744, Avg Acc: 0.1597
+INFO:master_logger:Epoch[021/300], Step[0850/1602], Avg Loss: 4.4844, Avg Acc: 0.1589
+INFO:local_logger:Epoch[021/300], Step[0900/1602], Avg Loss: 4.4748, Avg Acc: 0.1515
+INFO:local_logger:Epoch[021/300], Step[0900/1602], Avg Loss: 4.4868, Avg Acc: 0.1630
+INFO:local_logger:Epoch[021/300], Step[0900/1602], Avg Loss: 4.4797, Avg Acc: 0.1582
+INFO:local_logger:Epoch[021/300], Step[0900/1602], Avg Loss: 4.4985, Avg Acc: 0.1627
+INFO:master_logger:Epoch[021/300], Step[0900/1602], Avg Loss: 4.4849, Avg Acc: 0.1589
+INFO:local_logger:Epoch[021/300], Step[0950/1602], Avg Loss: 4.4787, Avg Acc: 0.1516
+INFO:local_logger:Epoch[021/300], Step[0950/1602], Avg Loss: 4.4839, Avg Acc: 0.1628
+INFO:local_logger:Epoch[021/300], Step[0950/1602], Avg Loss: 4.4956, Avg Acc: 0.1621
+INFO:master_logger:Epoch[021/300], Step[0950/1602], Avg Loss: 4.4862, Avg Acc: 0.1583
+INFO:local_logger:Epoch[021/300], Step[0950/1602], Avg Loss: 4.4867, Avg Acc: 0.1569
+INFO:local_logger:Epoch[021/300], Step[1000/1602], Avg Loss: 4.4974, Avg Acc: 0.1628
+INFO:local_logger:Epoch[021/300], Step[1000/1602], Avg Loss: 4.4815, Avg Acc: 0.1518
+INFO:local_logger:Epoch[021/300], Step[1000/1602], Avg Loss: 4.4845, Avg Acc: 0.1587
+INFO:master_logger:Epoch[021/300], Step[1000/1602], Avg Loss: 4.4859, Avg Acc: 0.1589
+INFO:local_logger:Epoch[021/300], Step[1000/1602], Avg Loss: 4.4804, Avg Acc: 0.1622
+INFO:local_logger:Epoch[021/300], Step[1050/1602], Avg Loss: 4.4840, Avg Acc: 0.1579
+INFO:local_logger:Epoch[021/300], Step[1050/1602], Avg Loss: 4.4839, Avg Acc: 0.1517
+INFO:local_logger:Epoch[021/300], Step[1050/1602], Avg Loss: 4.4814, Avg Acc: 0.1618
+INFO:local_logger:Epoch[021/300], Step[1050/1602], Avg Loss: 4.4905, Avg Acc: 0.1634
+INFO:master_logger:Epoch[021/300], Step[1050/1602], Avg Loss: 4.4850, Avg Acc: 0.1587
+INFO:local_logger:Epoch[021/300], Step[1100/1602], Avg Loss: 4.4884, Avg Acc: 0.1531
+INFO:local_logger:Epoch[021/300], Step[1100/1602], Avg Loss: 4.4815, Avg Acc: 0.1610
+INFO:local_logger:Epoch[021/300], Step[1100/1602], Avg Loss: 4.4915, Avg Acc: 0.1620
+INFO:local_logger:Epoch[021/300], Step[1100/1602], Avg Loss: 4.4806, Avg Acc: 0.1586
+INFO:master_logger:Epoch[021/300], Step[1100/1602], Avg Loss: 4.4855, Avg Acc: 0.1587
+INFO:local_logger:Epoch[021/300], Step[1150/1602], Avg Loss: 4.4903, Avg Acc: 0.1530
+INFO:local_logger:Epoch[021/300], Step[1150/1602], Avg Loss: 4.4805, Avg Acc: 0.1605
+INFO:local_logger:Epoch[021/300], Step[1150/1602], Avg Loss: 4.4899, Avg Acc: 0.1615
+INFO:master_logger:Epoch[021/300], Step[1150/1602], Avg Loss: 4.4850, Avg Acc: 0.1583
+INFO:local_logger:Epoch[021/300], Step[1150/1602], Avg Loss: 4.4791, Avg Acc: 0.1584
+INFO:local_logger:Epoch[021/300], Step[1200/1602], Avg Loss: 4.4909, Avg Acc: 0.1528
+INFO:local_logger:Epoch[021/300], Step[1200/1602], Avg Loss: 4.4775, Avg Acc: 0.1595
+INFO:local_logger:Epoch[021/300], Step[1200/1602], Avg Loss: 4.4911, Avg Acc: 0.1606
+INFO:local_logger:Epoch[021/300], Step[1200/1602], Avg Loss: 4.4772, Avg Acc: 0.1617
+INFO:master_logger:Epoch[021/300], Step[1200/1602], Avg Loss: 4.4842, Avg Acc: 0.1586
+INFO:local_logger:Epoch[021/300], Step[1250/1602], Avg Loss: 4.4851, Avg Acc: 0.1539
+INFO:local_logger:Epoch[021/300], Step[1250/1602], Avg Loss: 4.4770, Avg Acc: 0.1605
+INFO:local_logger:Epoch[021/300], Step[1250/1602], Avg Loss: 4.4806, Avg Acc: 0.1591
+INFO:local_logger:Epoch[021/300], Step[1250/1602], Avg Loss: 4.4911, Avg Acc: 0.1604
+INFO:master_logger:Epoch[021/300], Step[1250/1602], Avg Loss: 4.4835, Avg Acc: 0.1585
+INFO:local_logger:Epoch[021/300], Step[1300/1602], Avg Loss: 4.4842, Avg Acc: 0.1537
+INFO:local_logger:Epoch[021/300], Step[1300/1602], Avg Loss: 4.4874, Avg Acc: 0.1611
+INFO:local_logger:Epoch[021/300], Step[1300/1602], Avg Loss: 4.4748, Avg Acc: 0.1610
+INFO:local_logger:Epoch[021/300], Step[1300/1602], Avg Loss: 4.4830, Avg Acc: 0.1585
+INFO:master_logger:Epoch[021/300], Step[1300/1602], Avg Loss: 4.4823, Avg Acc: 0.1586
+INFO:local_logger:Epoch[021/300], Step[1350/1602], Avg Loss: 4.4817, Avg Acc: 0.1549
+INFO:local_logger:Epoch[021/300], Step[1350/1602], Avg Loss: 4.4826, Avg Acc: 0.1581
+INFO:local_logger:Epoch[021/300], Step[1350/1602], Avg Loss: 4.4708, Avg Acc: 0.1619
+INFO:master_logger:Epoch[021/300], Step[1350/1602], Avg Loss: 4.4802, Avg Acc: 0.1592
+INFO:local_logger:Epoch[021/300], Step[1350/1602], Avg Loss: 4.4858, Avg Acc: 0.1618
+INFO:local_logger:Epoch[021/300], Step[1400/1602], Avg Loss: 4.4795, Avg Acc: 0.1547
+INFO:local_logger:Epoch[021/300], Step[1400/1602], Avg Loss: 4.4858, Avg Acc: 0.1614
+INFO:local_logger:Epoch[021/300], Step[1400/1602], Avg Loss: 4.4688, Avg Acc: 0.1625
+INFO:master_logger:Epoch[021/300], Step[1400/1602], Avg Loss: 4.4791, Avg Acc: 0.1593
+INFO:local_logger:Epoch[021/300], Step[1400/1602], Avg Loss: 4.4823, Avg Acc: 0.1584
+INFO:local_logger:Epoch[021/300], Step[1450/1602], Avg Loss: 4.4659, Avg Acc: 0.1634
+INFO:local_logger:Epoch[021/300], Step[1450/1602], Avg Loss: 4.4797, Avg Acc: 0.1545
+INFO:local_logger:Epoch[021/300], Step[1450/1602], Avg Loss: 4.4815, Avg Acc: 0.1576
+INFO:local_logger:Epoch[021/300], Step[1450/1602], Avg Loss: 4.4863, Avg Acc: 0.1613
+INFO:master_logger:Epoch[021/300], Step[1450/1602], Avg Loss: 4.4784, Avg Acc: 0.1592
+INFO:local_logger:Epoch[021/300], Step[1500/1602], Avg Loss: 4.4769, Avg Acc: 0.1541
+INFO:local_logger:Epoch[021/300], Step[1500/1602], Avg Loss: 4.4627, Avg Acc: 0.1641
+INFO:local_logger:Epoch[021/300], Step[1500/1602], Avg Loss: 4.4874, Avg Acc: 0.1606
+INFO:local_logger:Epoch[021/300], Step[1500/1602], Avg Loss: 4.4797, Avg Acc: 0.1580
+INFO:master_logger:Epoch[021/300], Step[1500/1602], Avg Loss: 4.4767, Avg Acc: 0.1592
+INFO:local_logger:Epoch[021/300], Step[1550/1602], Avg Loss: 4.4798, Avg Acc: 0.1541
+INFO:local_logger:Epoch[021/300], Step[1550/1602], Avg Loss: 4.4756, Avg Acc: 0.1588
+INFO:local_logger:Epoch[021/300], Step[1550/1602], Avg Loss: 4.4642, Avg Acc: 0.1638
+INFO:local_logger:Epoch[021/300], Step[1550/1602], Avg Loss: 4.4866, Avg Acc: 0.1605
+INFO:master_logger:Epoch[021/300], Step[1550/1602], Avg Loss: 4.4765, Avg Acc: 0.1593
+INFO:local_logger:Epoch[021/300], Step[1600/1602], Avg Loss: 4.4661, Avg Acc: 0.1631
+INFO:local_logger:Epoch[021/300], Step[1600/1602], Avg Loss: 4.4736, Avg Acc: 0.1588
+INFO:local_logger:Epoch[021/300], Step[1600/1602], Avg Loss: 4.4808, Avg Acc: 0.1541
+INFO:local_logger:Epoch[021/300], Step[1600/1602], Avg Loss: 4.4855, Avg Acc: 0.1605
+INFO:master_logger:Epoch[021/300], Step[1600/1602], Avg Loss: 4.4765, Avg Acc: 0.1591
+INFO:local_logger:----- Epoch[021/300], Train Loss: 4.4855, Train Acc: 0.1605, time: 3714.26
+INFO:local_logger:----- Epoch[021/300], Train Loss: 4.4662, Train Acc: 0.1632, time: 3714.24
+INFO:local_logger:Now training epoch 22. LR=0.000387
+INFO:local_logger:Now training epoch 22. LR=0.000387
+INFO:local_logger:----- Epoch[021/300], Train Loss: 4.4738, Train Acc: 0.1587, time: 3714.26
+INFO:local_logger:Now training epoch 22. LR=0.000387
+INFO:local_logger:----- Epoch[021/300], Train Loss: 4.4811, Train Acc: 0.1541, time: 3713.98
+INFO:master_logger:----- Epoch[021/300], Train Loss: 4.4767, Train Acc: 0.1591, time: 3713.98
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-21-Loss-4.481053700254795.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-21-Loss-4.481053700254795.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-21-Loss-4.481053700254795-EMA.pdparams
+INFO:local_logger:Now training epoch 22. LR=0.000387
+INFO:master_logger:Now training epoch 22. LR=0.000387
+INFO:local_logger:Epoch[022/300], Step[0000/1602], Avg Loss: 4.7784, Avg Acc: 0.1700
+INFO:local_logger:Epoch[022/300], Step[0000/1602], Avg Loss: 3.8094, Avg Acc: 0.2650
+INFO:local_logger:Epoch[022/300], Step[0000/1602], Avg Loss: 4.9758, Avg Acc: 0.0150
+INFO:local_logger:Epoch[022/300], Step[0000/1602], Avg Loss: 5.0183, Avg Acc: 0.0400
+INFO:master_logger:Epoch[022/300], Step[0000/1602], Avg Loss: 4.6455, Avg Acc: 0.1225
+INFO:local_logger:Epoch[022/300], Step[0050/1602], Avg Loss: 4.3815, Avg Acc: 0.1694
+INFO:local_logger:Epoch[022/300], Step[0050/1602], Avg Loss: 4.3834, Avg Acc: 0.2024
+INFO:local_logger:Epoch[022/300], Step[0050/1602], Avg Loss: 4.4705, Avg Acc: 0.1720
+INFO:master_logger:Epoch[022/300], Step[0050/1602], Avg Loss: 4.4348, Avg Acc: 0.1772
+INFO:local_logger:Epoch[022/300], Step[0050/1602], Avg Loss: 4.5035, Avg Acc: 0.1650
+INFO:local_logger:Epoch[022/300], Step[0100/1602], Avg Loss: 4.4340, Avg Acc: 0.1592
+INFO:local_logger:Epoch[022/300], Step[0100/1602], Avg Loss: 4.3983, Avg Acc: 0.1588
+INFO:local_logger:Epoch[022/300], Step[0100/1602], Avg Loss: 4.4090, Avg Acc: 0.1857
+INFO:local_logger:Epoch[022/300], Step[0100/1602], Avg Loss: 4.4566, Avg Acc: 0.1610
+INFO:master_logger:Epoch[022/300], Step[0100/1602], Avg Loss: 4.4245, Avg Acc: 0.1662
+INFO:local_logger:Epoch[022/300], Step[0150/1602], Avg Loss: 4.4653, Avg Acc: 0.1638
+INFO:local_logger:Epoch[022/300], Step[0150/1602], Avg Loss: 4.4269, Avg Acc: 0.1743
+INFO:local_logger:Epoch[022/300], Step[0150/1602], Avg Loss: 4.4221, Avg Acc: 0.1541
+INFO:local_logger:Epoch[022/300], Step[0150/1602], Avg Loss: 4.4758, Avg Acc: 0.1663
+INFO:master_logger:Epoch[022/300], Step[0150/1602], Avg Loss: 4.4475, Avg Acc: 0.1646
+INFO:local_logger:Epoch[022/300], Step[0200/1602], Avg Loss: 4.4211, Avg Acc: 0.1551
+INFO:local_logger:Epoch[022/300], Step[0200/1602], Avg Loss: 4.4283, Avg Acc: 0.1618
+INFO:local_logger:Epoch[022/300], Step[0200/1602], Avg Loss: 4.4805, Avg Acc: 0.1636
+INFO:master_logger:Epoch[022/300], Step[0200/1602], Avg Loss: 4.4557, Avg Acc: 0.1608
+INFO:local_logger:Epoch[022/300], Step[0200/1602], Avg Loss: 4.4931, Avg Acc: 0.1628
+INFO:local_logger:Epoch[022/300], Step[0250/1602], Avg Loss: 4.4361, Avg Acc: 0.1593
+INFO:local_logger:Epoch[022/300], Step[0250/1602], Avg Loss: 4.4359, Avg Acc: 0.1588
+INFO:local_logger:Epoch[022/300], Step[0250/1602], Avg Loss: 4.4566, Avg Acc: 0.1636
+INFO:local_logger:Epoch[022/300], Step[0250/1602], Avg Loss: 4.4631, Avg Acc: 0.1649
+INFO:master_logger:Epoch[022/300], Step[0250/1602], Avg Loss: 4.4479, Avg Acc: 0.1616
+INFO:local_logger:Epoch[022/300], Step[0300/1602], Avg Loss: 4.4498, Avg Acc: 0.1685
+INFO:local_logger:Epoch[022/300], Step[0300/1602], Avg Loss: 4.4165, Avg Acc: 0.1560
+INFO:local_logger:Epoch[022/300], Step[0300/1602], Avg Loss: 4.4519, Avg Acc: 0.1595
+INFO:local_logger:Epoch[022/300], Step[0300/1602], Avg Loss: 4.4334, Avg Acc: 0.1582
+INFO:master_logger:Epoch[022/300], Step[0300/1602], Avg Loss: 4.4379, Avg Acc: 0.1606
+INFO:local_logger:Epoch[022/300], Step[0350/1602], Avg Loss: 4.4195, Avg Acc: 0.1565
+INFO:local_logger:Epoch[022/300], Step[0350/1602], Avg Loss: 4.4619, Avg Acc: 0.1661
+INFO:local_logger:Epoch[022/300], Step[0350/1602], Avg Loss: 4.4533, Avg Acc: 0.1632
+INFO:local_logger:Epoch[022/300], Step[0350/1602], Avg Loss: 4.4424, Avg Acc: 0.1568
+INFO:master_logger:Epoch[022/300], Step[0350/1602], Avg Loss: 4.4443, Avg Acc: 0.1607
+INFO:local_logger:Epoch[022/300], Step[0400/1602], Avg Loss: 4.4533, Avg Acc: 0.1653
+INFO:local_logger:Epoch[022/300], Step[0400/1602], Avg Loss: 4.4118, Avg Acc: 0.1573
+INFO:local_logger:Epoch[022/300], Step[0400/1602], Avg Loss: 4.4430, Avg Acc: 0.1590
+INFO:local_logger:Epoch[022/300], Step[0400/1602], Avg Loss: 4.4575, Avg Acc: 0.1693
+INFO:master_logger:Epoch[022/300], Step[0400/1602], Avg Loss: 4.4414, Avg Acc: 0.1627
+INFO:local_logger:Epoch[022/300], Step[0450/1602], Avg Loss: 4.4443, Avg Acc: 0.1595
+INFO:local_logger:Epoch[022/300], Step[0450/1602], Avg Loss: 4.3977, Avg Acc: 0.1592
+INFO:local_logger:Epoch[022/300], Step[0450/1602], Avg Loss: 4.4598, Avg Acc: 0.1637
+INFO:master_logger:Epoch[022/300], Step[0450/1602], Avg Loss: 4.4387, Avg Acc: 0.1632
+INFO:local_logger:Epoch[022/300], Step[0450/1602], Avg Loss: 4.4528, Avg Acc: 0.1704
+INFO:local_logger:Epoch[022/300], Step[0500/1602], Avg Loss: 4.4078, Avg Acc: 0.1605
+INFO:local_logger:Epoch[022/300], Step[0500/1602], Avg Loss: 4.4394, Avg Acc: 0.1620
+INFO:local_logger:Epoch[022/300], Step[0500/1602], Avg Loss: 4.4573, Avg Acc: 0.1665
+INFO:local_logger:Epoch[022/300], Step[0500/1602], Avg Loss: 4.4695, Avg Acc: 0.1643
+INFO:master_logger:Epoch[022/300], Step[0500/1602], Avg Loss: 4.4435, Avg Acc: 0.1633
+INFO:local_logger:Epoch[022/300], Step[0550/1602], Avg Loss: 4.4112, Avg Acc: 0.1637
+INFO:local_logger:Epoch[022/300], Step[0550/1602], Avg Loss: 4.4462, Avg Acc: 0.1623
+INFO:local_logger:Epoch[022/300], Step[0550/1602], Avg Loss: 4.4593, Avg Acc: 0.1648
+INFO:master_logger:Epoch[022/300], Step[0550/1602], Avg Loss: 4.4453, Avg Acc: 0.1639
+INFO:local_logger:Epoch[022/300], Step[0550/1602], Avg Loss: 4.4646, Avg Acc: 0.1646
+INFO:local_logger:Epoch[022/300], Step[0600/1602], Avg Loss: 4.4067, Avg Acc: 0.1610
+INFO:local_logger:Epoch[022/300], Step[0600/1602], Avg Loss: 4.4620, Avg Acc: 0.1642
+INFO:local_logger:Epoch[022/300], Step[0600/1602], Avg Loss: 4.4769, Avg Acc: 0.1623
+INFO:local_logger:Epoch[022/300], Step[0600/1602], Avg Loss: 4.4457, Avg Acc: 0.1634
+INFO:master_logger:Epoch[022/300], Step[0600/1602], Avg Loss: 4.4478, Avg Acc: 0.1627
+INFO:local_logger:Epoch[022/300], Step[0650/1602], Avg Loss: 4.4140, Avg Acc: 0.1610
+INFO:master_logger:Epoch[022/300], Step[0650/1602], Avg Loss: 4.4491, Avg Acc: 0.1634
+INFO:local_logger:Epoch[022/300], Step[0650/1602], Avg Loss: 4.4509, Avg Acc: 0.1638
+INFO:local_logger:Epoch[022/300], Step[0650/1602], Avg Loss: 4.4623, Avg Acc: 0.1644
+INFO:local_logger:Epoch[022/300], Step[0650/1602], Avg Loss: 4.4692, Avg Acc: 0.1642
+INFO:local_logger:Epoch[022/300], Step[0700/1602], Avg Loss: 4.4131, Avg Acc: 0.1611
+INFO:local_logger:Epoch[022/300], Step[0700/1602], Avg Loss: 4.4672, Avg Acc: 0.1641
+INFO:master_logger:Epoch[022/300], Step[0700/1602], Avg Loss: 4.4513, Avg Acc: 0.1637
+INFO:local_logger:Epoch[022/300], Step[0700/1602], Avg Loss: 4.4541, Avg Acc: 0.1644
+INFO:local_logger:Epoch[022/300], Step[0700/1602], Avg Loss: 4.4710, Avg Acc: 0.1653
+INFO:local_logger:Epoch[022/300], Step[0750/1602], Avg Loss: 4.4187, Avg Acc: 0.1586
+INFO:local_logger:Epoch[022/300], Step[0750/1602], Avg Loss: 4.4745, Avg Acc: 0.1645
+INFO:local_logger:Epoch[022/300], Step[0750/1602], Avg Loss: 4.4586, Avg Acc: 0.1662
+INFO:local_logger:Epoch[022/300], Step[0750/1602], Avg Loss: 4.4488, Avg Acc: 0.1645
+INFO:master_logger:Epoch[022/300], Step[0750/1602], Avg Loss: 4.4502, Avg Acc: 0.1634
+INFO:local_logger:Epoch[022/300], Step[0800/1602], Avg Loss: 4.4725, Avg Acc: 0.1655
+INFO:local_logger:Epoch[022/300], Step[0800/1602], Avg Loss: 4.4169, Avg Acc: 0.1575
+INFO:local_logger:Epoch[022/300], Step[0800/1602], Avg Loss: 4.4584, Avg Acc: 0.1653
+INFO:local_logger:Epoch[022/300], Step[0800/1602], Avg Loss: 4.4470, Avg Acc: 0.1651
+INFO:master_logger:Epoch[022/300], Step[0800/1602], Avg Loss: 4.4487, Avg Acc: 0.1634
+INFO:local_logger:Epoch[022/300], Step[0850/1602], Avg Loss: 4.4544, Avg Acc: 0.1669
+INFO:local_logger:Epoch[022/300], Step[0850/1602], Avg Loss: 4.4481, Avg Acc: 0.1650
+INFO:local_logger:Epoch[022/300], Step[0850/1602], Avg Loss: 4.4189, Avg Acc: 0.1588
+INFO:local_logger:Epoch[022/300], Step[0850/1602], Avg Loss: 4.4708, Avg Acc: 0.1656
+INFO:master_logger:Epoch[022/300], Step[0850/1602], Avg Loss: 4.4480, Avg Acc: 0.1641
+INFO:local_logger:Epoch[022/300], Step[0900/1602], Avg Loss: 4.4187, Avg Acc: 0.1590
+INFO:local_logger:Epoch[022/300], Step[0900/1602], Avg Loss: 4.4531, Avg Acc: 0.1664
+INFO:local_logger:Epoch[022/300], Step[0900/1602], Avg Loss: 4.4482, Avg Acc: 0.1641
+INFO:master_logger:Epoch[022/300], Step[0900/1602], Avg Loss: 4.4483, Avg Acc: 0.1639
+INFO:local_logger:Epoch[022/300], Step[0900/1602], Avg Loss: 4.4731, Avg Acc: 0.1659
+INFO:local_logger:Epoch[022/300], Step[0950/1602], Avg Loss: 4.4197, Avg Acc: 0.1595
+INFO:local_logger:Epoch[022/300], Step[0950/1602], Avg Loss: 4.4430, Avg Acc: 0.1667
+INFO:local_logger:Epoch[022/300], Step[0950/1602], Avg Loss: 4.4471, Avg Acc: 0.1633
+INFO:local_logger:Epoch[022/300], Step[0950/1602], Avg Loss: 4.4695, Avg Acc: 0.1661
+INFO:master_logger:Epoch[022/300], Step[0950/1602], Avg Loss: 4.4448, Avg Acc: 0.1639
+INFO:local_logger:Epoch[022/300], Step[1000/1602], Avg Loss: 4.4183, Avg Acc: 0.1604
+INFO:local_logger:Epoch[022/300], Step[1000/1602], Avg Loss: 4.4724, Avg Acc: 0.1655
+INFO:local_logger:Epoch[022/300], Step[1000/1602], Avg Loss: 4.4389, Avg Acc: 0.1682
+INFO:local_logger:Epoch[022/300], Step[1000/1602], Avg Loss: 4.4444, Avg Acc: 0.1632
+INFO:master_logger:Epoch[022/300], Step[1000/1602], Avg Loss: 4.4435, Avg Acc: 0.1643
+INFO:local_logger:Epoch[022/300], Step[1050/1602], Avg Loss: 4.4199, Avg Acc: 0.1613
+INFO:local_logger:Epoch[022/300], Step[1050/1602], Avg Loss: 4.4410, Avg Acc: 0.1632
+INFO:local_logger:Epoch[022/300], Step[1050/1602], Avg Loss: 4.4377, Avg Acc: 0.1686
+INFO:local_logger:Epoch[022/300], Step[1050/1602], Avg Loss: 4.4632, Avg Acc: 0.1653
+INFO:master_logger:Epoch[022/300], Step[1050/1602], Avg Loss: 4.4405, Avg Acc: 0.1646
+INFO:local_logger:Epoch[022/300], Step[1100/1602], Avg Loss: 4.4241, Avg Acc: 0.1615
+INFO:local_logger:Epoch[022/300], Step[1100/1602], Avg Loss: 4.4366, Avg Acc: 0.1645
+INFO:local_logger:Epoch[022/300], Step[1100/1602], Avg Loss: 4.4396, Avg Acc: 0.1671
+INFO:local_logger:Epoch[022/300], Step[1100/1602], Avg Loss: 4.4615, Avg Acc: 0.1647
+INFO:master_logger:Epoch[022/300], Step[1100/1602], Avg Loss: 4.4404, Avg Acc: 0.1645
+INFO:local_logger:Epoch[022/300], Step[1150/1602], Avg Loss: 4.4556, Avg Acc: 0.1665
+INFO:local_logger:Epoch[022/300], Step[1150/1602], Avg Loss: 4.4253, Avg Acc: 0.1613
+INFO:local_logger:Epoch[022/300], Step[1150/1602], Avg Loss: 4.4462, Avg Acc: 0.1669
+INFO:local_logger:Epoch[022/300], Step[1150/1602], Avg Loss: 4.4384, Avg Acc: 0.1636
+INFO:master_logger:Epoch[022/300], Step[1150/1602], Avg Loss: 4.4414, Avg Acc: 0.1646
+INFO:local_logger:Epoch[022/300], Step[1200/1602], Avg Loss: 4.4241, Avg Acc: 0.1624
+INFO:local_logger:Epoch[022/300], Step[1200/1602], Avg Loss: 4.4472, Avg Acc: 0.1673
+INFO:local_logger:Epoch[022/300], Step[1200/1602], Avg Loss: 4.4382, Avg Acc: 0.1627
+INFO:local_logger:Epoch[022/300], Step[1200/1602], Avg Loss: 4.4513, Avg Acc: 0.1656
+INFO:master_logger:Epoch[022/300], Step[1200/1602], Avg Loss: 4.4402, Avg Acc: 0.1645
+INFO:local_logger:Epoch[022/300], Step[1250/1602], Avg Loss: 4.4241, Avg Acc: 0.1639
+INFO:local_logger:Epoch[022/300], Step[1250/1602], Avg Loss: 4.4496, Avg Acc: 0.1671
+INFO:local_logger:Epoch[022/300], Step[1250/1602], Avg Loss: 4.4483, Avg Acc: 0.1660
+INFO:master_logger:Epoch[022/300], Step[1250/1602], Avg Loss: 4.4400, Avg Acc: 0.1649
+INFO:local_logger:Epoch[022/300], Step[1250/1602], Avg Loss: 4.4378, Avg Acc: 0.1625
+INFO:local_logger:Epoch[022/300], Step[1300/1602], Avg Loss: 4.4210, Avg Acc: 0.1633
+INFO:local_logger:Epoch[022/300], Step[1300/1602], Avg Loss: 4.4515, Avg Acc: 0.1660
+INFO:local_logger:Epoch[022/300], Step[1300/1602], Avg Loss: 4.4394, Avg Acc: 0.1629
+INFO:master_logger:Epoch[022/300], Step[1300/1602], Avg Loss: 4.4406, Avg Acc: 0.1647
+INFO:local_logger:Epoch[022/300], Step[1300/1602], Avg Loss: 4.4505, Avg Acc: 0.1664
+INFO:local_logger:Epoch[022/300], Step[1350/1602], Avg Loss: 4.4209, Avg Acc: 0.1632
+INFO:local_logger:Epoch[022/300], Step[1350/1602], Avg Loss: 4.4519, Avg Acc: 0.1658
+INFO:local_logger:Epoch[022/300], Step[1350/1602], Avg Loss: 4.4399, Avg Acc: 0.1630
+INFO:master_logger:Epoch[022/300], Step[1350/1602], Avg Loss: 4.4409, Avg Acc: 0.1646
+INFO:local_logger:Epoch[022/300], Step[1350/1602], Avg Loss: 4.4508, Avg Acc: 0.1663
+INFO:local_logger:Epoch[022/300], Step[1400/1602], Avg Loss: 4.4526, Avg Acc: 0.1662
+INFO:local_logger:Epoch[022/300], Step[1400/1602], Avg Loss: 4.4183, Avg Acc: 0.1627
+INFO:local_logger:Epoch[022/300], Step[1400/1602], Avg Loss: 4.4550, Avg Acc: 0.1658
+INFO:local_logger:Epoch[022/300], Step[1400/1602], Avg Loss: 4.4424, Avg Acc: 0.1637
+INFO:master_logger:Epoch[022/300], Step[1400/1602], Avg Loss: 4.4421, Avg Acc: 0.1646
+INFO:local_logger:Epoch[022/300], Step[1450/1602], Avg Loss: 4.4150, Avg Acc: 0.1634
+INFO:local_logger:Epoch[022/300], Step[1450/1602], Avg Loss: 4.4511, Avg Acc: 0.1660
+INFO:local_logger:Epoch[022/300], Step[1450/1602], Avg Loss: 4.4424, Avg Acc: 0.1634
+INFO:master_logger:Epoch[022/300], Step[1450/1602], Avg Loss: 4.4394, Avg Acc: 0.1643
+INFO:local_logger:Epoch[022/300], Step[1450/1602], Avg Loss: 4.4492, Avg Acc: 0.1642
+INFO:local_logger:Epoch[022/300], Step[1500/1602], Avg Loss: 4.4497, Avg Acc: 0.1664
+INFO:local_logger:Epoch[022/300], Step[1500/1602], Avg Loss: 4.4194, Avg Acc: 0.1639
+INFO:local_logger:Epoch[022/300], Step[1500/1602], Avg Loss: 4.4463, Avg Acc: 0.1647
+INFO:local_logger:Epoch[022/300], Step[1500/1602], Avg Loss: 4.4417, Avg Acc: 0.1640
+INFO:master_logger:Epoch[022/300], Step[1500/1602], Avg Loss: 4.4393, Avg Acc: 0.1648
+INFO:local_logger:Epoch[022/300], Step[1550/1602], Avg Loss: 4.4190, Avg Acc: 0.1638
+INFO:local_logger:Epoch[022/300], Step[1550/1602], Avg Loss: 4.4441, Avg Acc: 0.1643
+INFO:master_logger:Epoch[022/300], Step[1550/1602], Avg Loss: 4.4374, Avg Acc: 0.1647
+INFO:local_logger:Epoch[022/300], Step[1550/1602], Avg Loss: 4.4398, Avg Acc: 0.1644
+INFO:local_logger:Epoch[022/300], Step[1550/1602], Avg Loss: 4.4466, Avg Acc: 0.1661
+INFO:local_logger:Epoch[022/300], Step[1600/1602], Avg Loss: 4.4437, Avg Acc: 0.1647
+INFO:local_logger:Epoch[022/300], Step[1600/1602], Avg Loss: 4.4186, Avg Acc: 0.1640
+INFO:local_logger:Epoch[022/300], Step[1600/1602], Avg Loss: 4.4380, Avg Acc: 0.1647
+INFO:local_logger:Epoch[022/300], Step[1600/1602], Avg Loss: 4.4483, Avg Acc: 0.1662
+INFO:master_logger:Epoch[022/300], Step[1600/1602], Avg Loss: 4.4372, Avg Acc: 0.1649
+INFO:local_logger:----- Epoch[022/300], Train Loss: 4.4380, Train Acc: 0.1647, time: 3693.52
+INFO:local_logger:Now training epoch 23. LR=0.000387
+INFO:local_logger:----- Epoch[022/300], Train Loss: 4.4484, Train Acc: 0.1662, time: 3693.59
+INFO:local_logger:Now training epoch 23. LR=0.000387
+INFO:local_logger:----- Epoch[022/300], Train Loss: 4.4439, Train Acc: 0.1647, time: 3693.63
+INFO:local_logger:Now training epoch 23. LR=0.000387
+INFO:local_logger:----- Epoch[022/300], Train Loss: 4.4187, Train Acc: 0.1640, time: 3693.40
+INFO:master_logger:----- Epoch[022/300], Train Loss: 4.4373, Train Acc: 0.1649, time: 3693.40
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-22-Loss-4.418695037252678.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-22-Loss-4.418695037252678.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-22-Loss-4.418695037252678-EMA.pdparams
+INFO:local_logger:Now training epoch 23. LR=0.000387
+INFO:master_logger:Now training epoch 23. LR=0.000387
+INFO:local_logger:Epoch[023/300], Step[0000/1602], Avg Loss: 3.9936, Avg Acc: 0.2900
+INFO:local_logger:Epoch[023/300], Step[0000/1602], Avg Loss: 4.0048, Avg Acc: 0.2850
+INFO:local_logger:Epoch[023/300], Step[0000/1602], Avg Loss: 4.8466, Avg Acc: 0.1400
+INFO:master_logger:Epoch[023/300], Step[0000/1602], Avg Loss: 4.4522, Avg Acc: 0.2250
+INFO:local_logger:Epoch[023/300], Step[0000/1602], Avg Loss: 4.9637, Avg Acc: 0.1850
+INFO:local_logger:Epoch[023/300], Step[0050/1602], Avg Loss: 4.4620, Avg Acc: 0.1525
+INFO:local_logger:Epoch[023/300], Step[0050/1602], Avg Loss: 4.4540, Avg Acc: 0.1790
+INFO:local_logger:Epoch[023/300], Step[0050/1602], Avg Loss: 4.3275, Avg Acc: 0.1848
+INFO:local_logger:Epoch[023/300], Step[0050/1602], Avg Loss: 4.3526, Avg Acc: 0.1762
+INFO:master_logger:Epoch[023/300], Step[0050/1602], Avg Loss: 4.3990, Avg Acc: 0.1731
+INFO:local_logger:Epoch[023/300], Step[0100/1602], Avg Loss: 4.4313, Avg Acc: 0.1799
+INFO:local_logger:Epoch[023/300], Step[0100/1602], Avg Loss: 4.4171, Avg Acc: 0.1675
+INFO:local_logger:Epoch[023/300], Step[0100/1602], Avg Loss: 4.4181, Avg Acc: 0.1642
+INFO:local_logger:Epoch[023/300], Step[0100/1602], Avg Loss: 4.4082, Avg Acc: 0.1651
+INFO:master_logger:Epoch[023/300], Step[0100/1602], Avg Loss: 4.4187, Avg Acc: 0.1692
+INFO:local_logger:Epoch[023/300], Step[0150/1602], Avg Loss: 4.3747, Avg Acc: 0.1805
+INFO:local_logger:Epoch[023/300], Step[0150/1602], Avg Loss: 4.4388, Avg Acc: 0.1679
+INFO:local_logger:Epoch[023/300], Step[0150/1602], Avg Loss: 4.3997, Avg Acc: 0.1656
+INFO:local_logger:Epoch[023/300], Step[0150/1602], Avg Loss: 4.4134, Avg Acc: 0.1704
+INFO:master_logger:Epoch[023/300], Step[0150/1602], Avg Loss: 4.4066, Avg Acc: 0.1711
+INFO:local_logger:Epoch[023/300], Step[0200/1602], Avg Loss: 4.3924, Avg Acc: 0.1789
+INFO:local_logger:Epoch[023/300], Step[0200/1602], Avg Loss: 4.3802, Avg Acc: 0.1727
+INFO:master_logger:Epoch[023/300], Step[0200/1602], Avg Loss: 4.4122, Avg Acc: 0.1734
+INFO:local_logger:Epoch[023/300], Step[0200/1602], Avg Loss: 4.4341, Avg Acc: 0.1656
+INFO:local_logger:Epoch[023/300], Step[0200/1602], Avg Loss: 4.4420, Avg Acc: 0.1762
+INFO:local_logger:Epoch[023/300], Step[0250/1602], Avg Loss: 4.4173, Avg Acc: 0.1764
+INFO:local_logger:Epoch[023/300], Step[0250/1602], Avg Loss: 4.4116, Avg Acc: 0.1687
+INFO:local_logger:Epoch[023/300], Step[0250/1602], Avg Loss: 4.4302, Avg Acc: 0.1660
+INFO:local_logger:Epoch[023/300], Step[0250/1602], Avg Loss: 4.4402, Avg Acc: 0.1749
+INFO:master_logger:Epoch[023/300], Step[0250/1602], Avg Loss: 4.4248, Avg Acc: 0.1715
+INFO:local_logger:Epoch[023/300], Step[0300/1602], Avg Loss: 4.4029, Avg Acc: 0.1754
+INFO:local_logger:Epoch[023/300], Step[0300/1602], Avg Loss: 4.4107, Avg Acc: 0.1784
+INFO:local_logger:Epoch[023/300], Step[0300/1602], Avg Loss: 4.4295, Avg Acc: 0.1650
+INFO:local_logger:Epoch[023/300], Step[0300/1602], Avg Loss: 4.4133, Avg Acc: 0.1747
+INFO:master_logger:Epoch[023/300], Step[0300/1602], Avg Loss: 4.4141, Avg Acc: 0.1734
+INFO:local_logger:Epoch[023/300], Step[0350/1602], Avg Loss: 4.4025, Avg Acc: 0.1753
+INFO:local_logger:Epoch[023/300], Step[0350/1602], Avg Loss: 4.3958, Avg Acc: 0.1751
+INFO:local_logger:Epoch[023/300], Step[0350/1602], Avg Loss: 4.4303, Avg Acc: 0.1613
+INFO:local_logger:Epoch[023/300], Step[0350/1602], Avg Loss: 4.4149, Avg Acc: 0.1753
+INFO:master_logger:Epoch[023/300], Step[0350/1602], Avg Loss: 4.4108, Avg Acc: 0.1718
+INFO:local_logger:Epoch[023/300], Step[0400/1602], Avg Loss: 4.4122, Avg Acc: 0.1733
+INFO:local_logger:Epoch[023/300], Step[0400/1602], Avg Loss: 4.4037, Avg Acc: 0.1704
+INFO:local_logger:Epoch[023/300], Step[0400/1602], Avg Loss: 4.4204, Avg Acc: 0.1619
+INFO:local_logger:Epoch[023/300], Step[0400/1602], Avg Loss: 4.4206, Avg Acc: 0.1759
+INFO:master_logger:Epoch[023/300], Step[0400/1602], Avg Loss: 4.4142, Avg Acc: 0.1704
+INFO:local_logger:Epoch[023/300], Step[0450/1602], Avg Loss: 4.4195, Avg Acc: 0.1761
+INFO:local_logger:Epoch[023/300], Step[0450/1602], Avg Loss: 4.4140, Avg Acc: 0.1745
+INFO:local_logger:Epoch[023/300], Step[0450/1602], Avg Loss: 4.4112, Avg Acc: 0.1695
+INFO:local_logger:Epoch[023/300], Step[0450/1602], Avg Loss: 4.4128, Avg Acc: 0.1629
+INFO:master_logger:Epoch[023/300], Step[0450/1602], Avg Loss: 4.4144, Avg Acc: 0.1708
+INFO:local_logger:Epoch[023/300], Step[0500/1602], Avg Loss: 4.4208, Avg Acc: 0.1739
+INFO:local_logger:Epoch[023/300], Step[0500/1602], Avg Loss: 4.4158, Avg Acc: 0.1763
+INFO:local_logger:Epoch[023/300], Step[0500/1602], Avg Loss: 4.3970, Avg Acc: 0.1640
+INFO:local_logger:Epoch[023/300], Step[0500/1602], Avg Loss: 4.4198, Avg Acc: 0.1686
+INFO:master_logger:Epoch[023/300], Step[0500/1602], Avg Loss: 4.4133, Avg Acc: 0.1707
+INFO:local_logger:Epoch[023/300], Step[0550/1602], Avg Loss: 4.4175, Avg Acc: 0.1737
+INFO:local_logger:Epoch[023/300], Step[0550/1602], Avg Loss: 4.3903, Avg Acc: 0.1624
+INFO:local_logger:Epoch[023/300], Step[0550/1602], Avg Loss: 4.4166, Avg Acc: 0.1683
+INFO:local_logger:Epoch[023/300], Step[0550/1602], Avg Loss: 4.4168, Avg Acc: 0.1738
+INFO:master_logger:Epoch[023/300], Step[0550/1602], Avg Loss: 4.4103, Avg Acc: 0.1696
+INFO:local_logger:Epoch[023/300], Step[0600/1602], Avg Loss: 4.4147, Avg Acc: 0.1736
+INFO:local_logger:Epoch[023/300], Step[0600/1602], Avg Loss: 4.4126, Avg Acc: 0.1681
+INFO:local_logger:Epoch[023/300], Step[0600/1602], Avg Loss: 4.3903, Avg Acc: 0.1615
+INFO:master_logger:Epoch[023/300], Step[0600/1602], Avg Loss: 4.4090, Avg Acc: 0.1693
+INFO:local_logger:Epoch[023/300], Step[0600/1602], Avg Loss: 4.4186, Avg Acc: 0.1740
+INFO:local_logger:Epoch[023/300], Step[0650/1602], Avg Loss: 4.4084, Avg Acc: 0.1739
+INFO:local_logger:Epoch[023/300], Step[0650/1602], Avg Loss: 4.3876, Avg Acc: 0.1647
+INFO:local_logger:Epoch[023/300], Step[0650/1602], Avg Loss: 4.4080, Avg Acc: 0.1733
+INFO:master_logger:Epoch[023/300], Step[0650/1602], Avg Loss: 4.4034, Avg Acc: 0.1699
+INFO:local_logger:Epoch[023/300], Step[0650/1602], Avg Loss: 4.4097, Avg Acc: 0.1676
+INFO:local_logger:Epoch[023/300], Step[0700/1602], Avg Loss: 4.4109, Avg Acc: 0.1729
+INFO:local_logger:Epoch[023/300], Step[0700/1602], Avg Loss: 4.4048, Avg Acc: 0.1728
+INFO:local_logger:Epoch[023/300], Step[0700/1602], Avg Loss: 4.3873, Avg Acc: 0.1642
+INFO:local_logger:Epoch[023/300], Step[0700/1602], Avg Loss: 4.4150, Avg Acc: 0.1694
+INFO:master_logger:Epoch[023/300], Step[0700/1602], Avg Loss: 4.4045, Avg Acc: 0.1698
+INFO:local_logger:Epoch[023/300], Step[0750/1602], Avg Loss: 4.3855, Avg Acc: 0.1640
+INFO:local_logger:Epoch[023/300], Step[0750/1602], Avg Loss: 4.4149, Avg Acc: 0.1721
+INFO:local_logger:Epoch[023/300], Step[0750/1602], Avg Loss: 4.3984, Avg Acc: 0.1728
+INFO:local_logger:Epoch[023/300], Step[0750/1602], Avg Loss: 4.4112, Avg Acc: 0.1689
+INFO:master_logger:Epoch[023/300], Step[0750/1602], Avg Loss: 4.4025, Avg Acc: 0.1695
+INFO:local_logger:Epoch[023/300], Step[0800/1602], Avg Loss: 4.4060, Avg Acc: 0.1725
+INFO:local_logger:Epoch[023/300], Step[0800/1602], Avg Loss: 4.4121, Avg Acc: 0.1726
+INFO:master_logger:Epoch[023/300], Step[0800/1602], Avg Loss: 4.4030, Avg Acc: 0.1693
+INFO:local_logger:Epoch[023/300], Step[0800/1602], Avg Loss: 4.3873, Avg Acc: 0.1642
+INFO:local_logger:Epoch[023/300], Step[0800/1602], Avg Loss: 4.4065, Avg Acc: 0.1681
+INFO:local_logger:Epoch[023/300], Step[0850/1602], Avg Loss: 4.4085, Avg Acc: 0.1716
+INFO:local_logger:Epoch[023/300], Step[0850/1602], Avg Loss: 4.4111, Avg Acc: 0.1685
+INFO:local_logger:Epoch[023/300], Step[0850/1602], Avg Loss: 4.3981, Avg Acc: 0.1636
+INFO:local_logger:Epoch[023/300], Step[0850/1602], Avg Loss: 4.4119, Avg Acc: 0.1716
+INFO:master_logger:Epoch[023/300], Step[0850/1602], Avg Loss: 4.4074, Avg Acc: 0.1688
+INFO:local_logger:Epoch[023/300], Step[0900/1602], Avg Loss: 4.4057, Avg Acc: 0.1721
+INFO:local_logger:Epoch[023/300], Step[0900/1602], Avg Loss: 4.3983, Avg Acc: 0.1645
+INFO:local_logger:Epoch[023/300], Step[0900/1602], Avg Loss: 4.4048, Avg Acc: 0.1672
+INFO:local_logger:Epoch[023/300], Step[0900/1602], Avg Loss: 4.4127, Avg Acc: 0.1712
+INFO:master_logger:Epoch[023/300], Step[0900/1602], Avg Loss: 4.4054, Avg Acc: 0.1688
+INFO:local_logger:Epoch[023/300], Step[0950/1602], Avg Loss: 4.4066, Avg Acc: 0.1724
+INFO:local_logger:Epoch[023/300], Step[0950/1602], Avg Loss: 4.3983, Avg Acc: 0.1659
+INFO:local_logger:Epoch[023/300], Step[0950/1602], Avg Loss: 4.4098, Avg Acc: 0.1720
+INFO:local_logger:Epoch[023/300], Step[0950/1602], Avg Loss: 4.4037, Avg Acc: 0.1656
+INFO:master_logger:Epoch[023/300], Step[0950/1602], Avg Loss: 4.4046, Avg Acc: 0.1690
+INFO:local_logger:Epoch[023/300], Step[1000/1602], Avg Loss: 4.4036, Avg Acc: 0.1661
+INFO:local_logger:Epoch[023/300], Step[1000/1602], Avg Loss: 4.4048, Avg Acc: 0.1717
+INFO:local_logger:Epoch[023/300], Step[1000/1602], Avg Loss: 4.3896, Avg Acc: 0.1670
+INFO:local_logger:Epoch[023/300], Step[1000/1602], Avg Loss: 4.4087, Avg Acc: 0.1709
+INFO:master_logger:Epoch[023/300], Step[1000/1602], Avg Loss: 4.4016, Avg Acc: 0.1689
+INFO:local_logger:Epoch[023/300], Step[1050/1602], Avg Loss: 4.4047, Avg Acc: 0.1698
+INFO:local_logger:Epoch[023/300], Step[1050/1602], Avg Loss: 4.4003, Avg Acc: 0.1666
+INFO:local_logger:Epoch[023/300], Step[1050/1602], Avg Loss: 4.3881, Avg Acc: 0.1677
+INFO:local_logger:Epoch[023/300], Step[1050/1602], Avg Loss: 4.4126, Avg Acc: 0.1699
+INFO:master_logger:Epoch[023/300], Step[1050/1602], Avg Loss: 4.4014, Avg Acc: 0.1685
+INFO:local_logger:Epoch[023/300], Step[1100/1602], Avg Loss: 4.4046, Avg Acc: 0.1691
+INFO:local_logger:Epoch[023/300], Step[1100/1602], Avg Loss: 4.4113, Avg Acc: 0.1697
+INFO:local_logger:Epoch[023/300], Step[1100/1602], Avg Loss: 4.3961, Avg Acc: 0.1668
+INFO:local_logger:Epoch[023/300], Step[1100/1602], Avg Loss: 4.3881, Avg Acc: 0.1681
+INFO:master_logger:Epoch[023/300], Step[1100/1602], Avg Loss: 4.4000, Avg Acc: 0.1684
+INFO:local_logger:Epoch[023/300], Step[1150/1602], Avg Loss: 4.4026, Avg Acc: 0.1693
+INFO:local_logger:Epoch[023/300], Step[1150/1602], Avg Loss: 4.3910, Avg Acc: 0.1683
+INFO:local_logger:Epoch[023/300], Step[1150/1602], Avg Loss: 4.3961, Avg Acc: 0.1677
+INFO:master_logger:Epoch[023/300], Step[1150/1602], Avg Loss: 4.4008, Avg Acc: 0.1687
+INFO:local_logger:Epoch[023/300], Step[1150/1602], Avg Loss: 4.4136, Avg Acc: 0.1697
+INFO:local_logger:Epoch[023/300], Step[1200/1602], Avg Loss: 4.3975, Avg Acc: 0.1673
+INFO:local_logger:Epoch[023/300], Step[1200/1602], Avg Loss: 4.3973, Avg Acc: 0.1690
+INFO:local_logger:Epoch[023/300], Step[1200/1602], Avg Loss: 4.3929, Avg Acc: 0.1681
+INFO:master_logger:Epoch[023/300], Step[1200/1602], Avg Loss: 4.4001, Avg Acc: 0.1683
+INFO:local_logger:Epoch[023/300], Step[1200/1602], Avg Loss: 4.4126, Avg Acc: 0.1686
+INFO:local_logger:Epoch[023/300], Step[1250/1602], Avg Loss: 4.3945, Avg Acc: 0.1692
+INFO:local_logger:Epoch[023/300], Step[1250/1602], Avg Loss: 4.4125, Avg Acc: 0.1688
+INFO:local_logger:Epoch[023/300], Step[1250/1602], Avg Loss: 4.3942, Avg Acc: 0.1682
+INFO:local_logger:Epoch[023/300], Step[1250/1602], Avg Loss: 4.3923, Avg Acc: 0.1687
+INFO:master_logger:Epoch[023/300], Step[1250/1602], Avg Loss: 4.3983, Avg Acc: 0.1687
+INFO:local_logger:Epoch[023/300], Step[1300/1602], Avg Loss: 4.3913, Avg Acc: 0.1694
+INFO:local_logger:Epoch[023/300], Step[1300/1602], Avg Loss: 4.3919, Avg Acc: 0.1686
+INFO:master_logger:Epoch[023/300], Step[1300/1602], Avg Loss: 4.3967, Avg Acc: 0.1690
+INFO:local_logger:Epoch[023/300], Step[1300/1602], Avg Loss: 4.3937, Avg Acc: 0.1692
+INFO:local_logger:Epoch[023/300], Step[1300/1602], Avg Loss: 4.4098, Avg Acc: 0.1689
+INFO:local_logger:Epoch[023/300], Step[1350/1602], Avg Loss: 4.3917, Avg Acc: 0.1701
+INFO:local_logger:Epoch[023/300], Step[1350/1602], Avg Loss: 4.3949, Avg Acc: 0.1688
+INFO:local_logger:Epoch[023/300], Step[1350/1602], Avg Loss: 4.4098, Avg Acc: 0.1691
+INFO:local_logger:Epoch[023/300], Step[1350/1602], Avg Loss: 4.3914, Avg Acc: 0.1681
+INFO:master_logger:Epoch[023/300], Step[1350/1602], Avg Loss: 4.3970, Avg Acc: 0.1690
+INFO:local_logger:Epoch[023/300], Step[1400/1602], Avg Loss: 4.3938, Avg Acc: 0.1700
+INFO:local_logger:Epoch[023/300], Step[1400/1602], Avg Loss: 4.4081, Avg Acc: 0.1691
+INFO:local_logger:Epoch[023/300], Step[1400/1602], Avg Loss: 4.3890, Avg Acc: 0.1680
+INFO:local_logger:Epoch[023/300], Step[1400/1602], Avg Loss: 4.3966, Avg Acc: 0.1682
+INFO:master_logger:Epoch[023/300], Step[1400/1602], Avg Loss: 4.3969, Avg Acc: 0.1688
+INFO:local_logger:Epoch[023/300], Step[1450/1602], Avg Loss: 4.3948, Avg Acc: 0.1706
+INFO:local_logger:Epoch[023/300], Step[1450/1602], Avg Loss: 4.3935, Avg Acc: 0.1683
+INFO:local_logger:Epoch[023/300], Step[1450/1602], Avg Loss: 4.4118, Avg Acc: 0.1681
+INFO:local_logger:Epoch[023/300], Step[1450/1602], Avg Loss: 4.3881, Avg Acc: 0.1690
+INFO:master_logger:Epoch[023/300], Step[1450/1602], Avg Loss: 4.3971, Avg Acc: 0.1690
+INFO:local_logger:Epoch[023/300], Step[1500/1602], Avg Loss: 4.3985, Avg Acc: 0.1705
+INFO:local_logger:Epoch[023/300], Step[1500/1602], Avg Loss: 4.3838, Avg Acc: 0.1699
+INFO:local_logger:Epoch[023/300], Step[1500/1602], Avg Loss: 4.3980, Avg Acc: 0.1674
+INFO:master_logger:Epoch[023/300], Step[1500/1602], Avg Loss: 4.3976, Avg Acc: 0.1688
+INFO:local_logger:Epoch[023/300], Step[1500/1602], Avg Loss: 4.4102, Avg Acc: 0.1676
+INFO:local_logger:Epoch[023/300], Step[1550/1602], Avg Loss: 4.3999, Avg Acc: 0.1701
+INFO:local_logger:Epoch[023/300], Step[1550/1602], Avg Loss: 4.3858, Avg Acc: 0.1698
+INFO:local_logger:Epoch[023/300], Step[1550/1602], Avg Loss: 4.3986, Avg Acc: 0.1683
+INFO:local_logger:Epoch[023/300], Step[1550/1602], Avg Loss: 4.4098, Avg Acc: 0.1680
+INFO:master_logger:Epoch[023/300], Step[1550/1602], Avg Loss: 4.3985, Avg Acc: 0.1691
+INFO:local_logger:Epoch[023/300], Step[1600/1602], Avg Loss: 4.3822, Avg Acc: 0.1701
+INFO:local_logger:Epoch[023/300], Step[1600/1602], Avg Loss: 4.3969, Avg Acc: 0.1708
+INFO:local_logger:Epoch[023/300], Step[1600/1602], Avg Loss: 4.3995, Avg Acc: 0.1680
+INFO:master_logger:Epoch[023/300], Step[1600/1602], Avg Loss: 4.3966, Avg Acc: 0.1693
+INFO:local_logger:Epoch[023/300], Step[1600/1602], Avg Loss: 4.4081, Avg Acc: 0.1683
+INFO:local_logger:----- Epoch[023/300], Train Loss: 4.3820, Train Acc: 0.1701, time: 3709.51
+INFO:local_logger:Now training epoch 24. LR=0.000387
+INFO:local_logger:----- Epoch[023/300], Train Loss: 4.4079, Train Acc: 0.1682, time: 3709.54
+INFO:local_logger:----- Epoch[023/300], Train Loss: 4.3969, Train Acc: 0.1709, time: 3709.23
+INFO:local_logger:Now training epoch 24. LR=0.000387
+INFO:master_logger:----- Epoch[023/300], Train Loss: 4.3966, Train Acc: 0.1693, time: 3709.23
+INFO:local_logger:----- Epoch[023/300], Train Loss: 4.3997, Train Acc: 0.1679, time: 3709.62
+INFO:local_logger:Now training epoch 24. LR=0.000387
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-23-Loss-4.396864708679437.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-23-Loss-4.396864708679437.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-23-Loss-4.396864708679437-EMA.pdparams
+INFO:local_logger:Now training epoch 24. LR=0.000387
+INFO:master_logger:Now training epoch 24. LR=0.000387
+INFO:local_logger:Epoch[024/300], Step[0000/1602], Avg Loss: 4.7974, Avg Acc: 0.1600
+INFO:local_logger:Epoch[024/300], Step[0000/1602], Avg Loss: 4.3962, Avg Acc: 0.2950
+INFO:master_logger:Epoch[024/300], Step[0000/1602], Avg Loss: 4.2890, Avg Acc: 0.2012
+INFO:local_logger:Epoch[024/300], Step[0000/1602], Avg Loss: 3.7205, Avg Acc: 0.3450
+INFO:local_logger:Epoch[024/300], Step[0000/1602], Avg Loss: 4.2418, Avg Acc: 0.0050
+INFO:local_logger:Epoch[024/300], Step[0050/1602], Avg Loss: 4.4177, Avg Acc: 0.1635
+INFO:local_logger:Epoch[024/300], Step[0050/1602], Avg Loss: 4.3808, Avg Acc: 0.2030
+INFO:local_logger:Epoch[024/300], Step[0050/1602], Avg Loss: 4.3258, Avg Acc: 0.1588
+INFO:local_logger:Epoch[024/300], Step[0050/1602], Avg Loss: 4.4023, Avg Acc: 0.1836
+INFO:master_logger:Epoch[024/300], Step[0050/1602], Avg Loss: 4.3817, Avg Acc: 0.1773
+INFO:local_logger:Epoch[024/300], Step[0100/1602], Avg Loss: 4.3879, Avg Acc: 0.1624
+INFO:local_logger:Epoch[024/300], Step[0100/1602], Avg Loss: 4.3705, Avg Acc: 0.1900
+INFO:local_logger:Epoch[024/300], Step[0100/1602], Avg Loss: 4.3719, Avg Acc: 0.1832
+INFO:local_logger:Epoch[024/300], Step[0100/1602], Avg Loss: 4.3693, Avg Acc: 0.1593
+INFO:master_logger:Epoch[024/300], Step[0100/1602], Avg Loss: 4.3749, Avg Acc: 0.1737
+INFO:local_logger:Epoch[024/300], Step[0150/1602], Avg Loss: 4.3951, Avg Acc: 0.1693
+INFO:local_logger:Epoch[024/300], Step[0150/1602], Avg Loss: 4.3782, Avg Acc: 0.1651
+INFO:local_logger:Epoch[024/300], Step[0150/1602], Avg Loss: 4.4123, Avg Acc: 0.1837
+INFO:local_logger:Epoch[024/300], Step[0150/1602], Avg Loss: 4.3334, Avg Acc: 0.1779
+INFO:master_logger:Epoch[024/300], Step[0150/1602], Avg Loss: 4.3798, Avg Acc: 0.1740
+INFO:local_logger:Epoch[024/300], Step[0200/1602], Avg Loss: 4.4316, Avg Acc: 0.1793
+INFO:local_logger:Epoch[024/300], Step[0200/1602], Avg Loss: 4.3900, Avg Acc: 0.1725
+INFO:local_logger:Epoch[024/300], Step[0200/1602], Avg Loss: 4.3752, Avg Acc: 0.1745
+INFO:local_logger:Epoch[024/300], Step[0200/1602], Avg Loss: 4.3853, Avg Acc: 0.1653
+INFO:master_logger:Epoch[024/300], Step[0200/1602], Avg Loss: 4.3955, Avg Acc: 0.1729
+INFO:local_logger:Epoch[024/300], Step[0250/1602], Avg Loss: 4.3927, Avg Acc: 0.1735
+INFO:local_logger:Epoch[024/300], Step[0250/1602], Avg Loss: 4.3939, Avg Acc: 0.1649
+INFO:local_logger:Epoch[024/300], Step[0250/1602], Avg Loss: 4.4067, Avg Acc: 0.1790
+INFO:local_logger:Epoch[024/300], Step[0250/1602], Avg Loss: 4.3682, Avg Acc: 0.1705
+INFO:master_logger:Epoch[024/300], Step[0250/1602], Avg Loss: 4.3904, Avg Acc: 0.1720
+INFO:local_logger:Epoch[024/300], Step[0300/1602], Avg Loss: 4.3944, Avg Acc: 0.1783
+INFO:local_logger:Epoch[024/300], Step[0300/1602], Avg Loss: 4.3610, Avg Acc: 0.1735
+INFO:local_logger:Epoch[024/300], Step[0300/1602], Avg Loss: 4.3827, Avg Acc: 0.1649
+INFO:local_logger:Epoch[024/300], Step[0300/1602], Avg Loss: 4.3772, Avg Acc: 0.1702
+INFO:master_logger:Epoch[024/300], Step[0300/1602], Avg Loss: 4.3788, Avg Acc: 0.1717
+INFO:local_logger:Epoch[024/300], Step[0350/1602], Avg Loss: 4.3774, Avg Acc: 0.1725
+INFO:local_logger:Epoch[024/300], Step[0350/1602], Avg Loss: 4.3897, Avg Acc: 0.1689
+INFO:local_logger:Epoch[024/300], Step[0350/1602], Avg Loss: 4.4065, Avg Acc: 0.1761
+INFO:local_logger:Epoch[024/300], Step[0350/1602], Avg Loss: 4.3614, Avg Acc: 0.1734
+INFO:master_logger:Epoch[024/300], Step[0350/1602], Avg Loss: 4.3837, Avg Acc: 0.1727
+INFO:local_logger:Epoch[024/300], Step[0400/1602], Avg Loss: 4.4202, Avg Acc: 0.1739
+INFO:local_logger:Epoch[024/300], Step[0400/1602], Avg Loss: 4.3680, Avg Acc: 0.1767
+INFO:local_logger:Epoch[024/300], Step[0400/1602], Avg Loss: 4.3759, Avg Acc: 0.1727
+INFO:master_logger:Epoch[024/300], Step[0400/1602], Avg Loss: 4.3842, Avg Acc: 0.1735
+INFO:local_logger:Epoch[024/300], Step[0400/1602], Avg Loss: 4.3726, Avg Acc: 0.1708
+INFO:local_logger:Epoch[024/300], Step[0450/1602], Avg Loss: 4.3729, Avg Acc: 0.1715
+INFO:local_logger:Epoch[024/300], Step[0450/1602], Avg Loss: 4.4117, Avg Acc: 0.1752
+INFO:local_logger:Epoch[024/300], Step[0450/1602], Avg Loss: 4.3700, Avg Acc: 0.1770
+INFO:local_logger:Epoch[024/300], Step[0450/1602], Avg Loss: 4.3779, Avg Acc: 0.1718
+INFO:master_logger:Epoch[024/300], Step[0450/1602], Avg Loss: 4.3831, Avg Acc: 0.1739
+INFO:local_logger:Epoch[024/300], Step[0500/1602], Avg Loss: 4.3854, Avg Acc: 0.1743
+INFO:local_logger:Epoch[024/300], Step[0500/1602], Avg Loss: 4.3843, Avg Acc: 0.1702
+INFO:local_logger:Epoch[024/300], Step[0500/1602], Avg Loss: 4.4003, Avg Acc: 0.1760
+INFO:master_logger:Epoch[024/300], Step[0500/1602], Avg Loss: 4.3875, Avg Acc: 0.1723
+INFO:local_logger:Epoch[024/300], Step[0500/1602], Avg Loss: 4.3800, Avg Acc: 0.1686
+INFO:local_logger:Epoch[024/300], Step[0550/1602], Avg Loss: 4.3935, Avg Acc: 0.1685
+INFO:local_logger:Epoch[024/300], Step[0550/1602], Avg Loss: 4.3703, Avg Acc: 0.1683
+INFO:local_logger:Epoch[024/300], Step[0550/1602], Avg Loss: 4.3771, Avg Acc: 0.1703
+INFO:local_logger:Epoch[024/300], Step[0550/1602], Avg Loss: 4.3888, Avg Acc: 0.1757
+INFO:master_logger:Epoch[024/300], Step[0550/1602], Avg Loss: 4.3824, Avg Acc: 0.1707
+INFO:local_logger:Epoch[024/300], Step[0600/1602], Avg Loss: 4.3610, Avg Acc: 0.1699
+INFO:local_logger:Epoch[024/300], Step[0600/1602], Avg Loss: 4.3829, Avg Acc: 0.1675
+INFO:local_logger:Epoch[024/300], Step[0600/1602], Avg Loss: 4.3823, Avg Acc: 0.1686
+INFO:local_logger:Epoch[024/300], Step[0600/1602], Avg Loss: 4.3949, Avg Acc: 0.1755
+INFO:master_logger:Epoch[024/300], Step[0600/1602], Avg Loss: 4.3803, Avg Acc: 0.1704
+INFO:local_logger:Epoch[024/300], Step[0650/1602], Avg Loss: 4.3878, Avg Acc: 0.1694
+INFO:local_logger:Epoch[024/300], Step[0650/1602], Avg Loss: 4.3819, Avg Acc: 0.1690
+INFO:local_logger:Epoch[024/300], Step[0650/1602], Avg Loss: 4.3924, Avg Acc: 0.1764
+INFO:master_logger:Epoch[024/300], Step[0650/1602], Avg Loss: 4.3782, Avg Acc: 0.1714
+INFO:local_logger:Epoch[024/300], Step[0650/1602], Avg Loss: 4.3507, Avg Acc: 0.1708
+INFO:local_logger:Epoch[024/300], Step[0700/1602], Avg Loss: 4.3992, Avg Acc: 0.1776
+INFO:local_logger:Epoch[024/300], Step[0700/1602], Avg Loss: 4.3892, Avg Acc: 0.1668
+INFO:local_logger:Epoch[024/300], Step[0700/1602], Avg Loss: 4.3917, Avg Acc: 0.1672
+INFO:local_logger:Epoch[024/300], Step[0700/1602], Avg Loss: 4.3459, Avg Acc: 0.1710
+INFO:master_logger:Epoch[024/300], Step[0700/1602], Avg Loss: 4.3815, Avg Acc: 0.1707
+INFO:local_logger:Epoch[024/300], Step[0750/1602], Avg Loss: 4.3876, Avg Acc: 0.1655
+INFO:local_logger:Epoch[024/300], Step[0750/1602], Avg Loss: 4.3857, Avg Acc: 0.1673
+INFO:master_logger:Epoch[024/300], Step[0750/1602], Avg Loss: 4.3811, Avg Acc: 0.1700
+INFO:local_logger:Epoch[024/300], Step[0750/1602], Avg Loss: 4.3974, Avg Acc: 0.1757
+INFO:local_logger:Epoch[024/300], Step[0750/1602], Avg Loss: 4.3538, Avg Acc: 0.1714
+INFO:local_logger:Epoch[024/300], Step[0800/1602], Avg Loss: 4.3615, Avg Acc: 0.1712
+INFO:local_logger:Epoch[024/300], Step[0800/1602], Avg Loss: 4.3832, Avg Acc: 0.1658
+INFO:local_logger:Epoch[024/300], Step[0800/1602], Avg Loss: 4.3838, Avg Acc: 0.1677
+INFO:local_logger:Epoch[024/300], Step[0800/1602], Avg Loss: 4.3955, Avg Acc: 0.1739
+INFO:master_logger:Epoch[024/300], Step[0800/1602], Avg Loss: 4.3810, Avg Acc: 0.1697
+INFO:local_logger:Epoch[024/300], Step[0850/1602], Avg Loss: 4.3828, Avg Acc: 0.1656
+INFO:local_logger:Epoch[024/300], Step[0850/1602], Avg Loss: 4.3806, Avg Acc: 0.1681
+INFO:local_logger:Epoch[024/300], Step[0850/1602], Avg Loss: 4.3955, Avg Acc: 0.1732
+INFO:local_logger:Epoch[024/300], Step[0850/1602], Avg Loss: 4.3609, Avg Acc: 0.1716
+INFO:master_logger:Epoch[024/300], Step[0850/1602], Avg Loss: 4.3800, Avg Acc: 0.1696
+INFO:local_logger:Epoch[024/300], Step[0900/1602], Avg Loss: 4.3589, Avg Acc: 0.1730
+INFO:local_logger:Epoch[024/300], Step[0900/1602], Avg Loss: 4.3750, Avg Acc: 0.1656
+INFO:local_logger:Epoch[024/300], Step[0900/1602], Avg Loss: 4.3919, Avg Acc: 0.1728
+INFO:local_logger:Epoch[024/300], Step[0900/1602], Avg Loss: 4.3726, Avg Acc: 0.1683
+INFO:master_logger:Epoch[024/300], Step[0900/1602], Avg Loss: 4.3746, Avg Acc: 0.1699
+INFO:local_logger:Epoch[024/300], Step[0950/1602], Avg Loss: 4.3711, Avg Acc: 0.1678
+INFO:local_logger:Epoch[024/300], Step[0950/1602], Avg Loss: 4.3910, Avg Acc: 0.1735
+INFO:master_logger:Epoch[024/300], Step[0950/1602], Avg Loss: 4.3720, Avg Acc: 0.1708
+INFO:local_logger:Epoch[024/300], Step[0950/1602], Avg Loss: 4.3661, Avg Acc: 0.1682
+INFO:local_logger:Epoch[024/300], Step[0950/1602], Avg Loss: 4.3598, Avg Acc: 0.1735
+INFO:local_logger:Epoch[024/300], Step[1000/1602], Avg Loss: 4.3757, Avg Acc: 0.1673
+INFO:local_logger:Epoch[024/300], Step[1000/1602], Avg Loss: 4.3686, Avg Acc: 0.1684
+INFO:local_logger:Epoch[024/300], Step[1000/1602], Avg Loss: 4.3567, Avg Acc: 0.1739
+INFO:local_logger:Epoch[024/300], Step[1000/1602], Avg Loss: 4.3888, Avg Acc: 0.1741
+INFO:master_logger:Epoch[024/300], Step[1000/1602], Avg Loss: 4.3724, Avg Acc: 0.1709
+INFO:local_logger:Epoch[024/300], Step[1050/1602], Avg Loss: 4.3737, Avg Acc: 0.1671
+INFO:local_logger:Epoch[024/300], Step[1050/1602], Avg Loss: 4.3691, Avg Acc: 0.1671
+INFO:local_logger:Epoch[024/300], Step[1050/1602], Avg Loss: 4.3582, Avg Acc: 0.1735
+INFO:local_logger:Epoch[024/300], Step[1050/1602], Avg Loss: 4.3843, Avg Acc: 0.1744
+INFO:master_logger:Epoch[024/300], Step[1050/1602], Avg Loss: 4.3713, Avg Acc: 0.1705
+INFO:local_logger:Epoch[024/300], Step[1100/1602], Avg Loss: 4.3654, Avg Acc: 0.1672
+INFO:local_logger:Epoch[024/300], Step[1100/1602], Avg Loss: 4.3801, Avg Acc: 0.1664
+INFO:local_logger:Epoch[024/300], Step[1100/1602], Avg Loss: 4.3607, Avg Acc: 0.1740
+INFO:master_logger:Epoch[024/300], Step[1100/1602], Avg Loss: 4.3727, Avg Acc: 0.1704
+INFO:local_logger:Epoch[024/300], Step[1100/1602], Avg Loss: 4.3846, Avg Acc: 0.1740
+INFO:local_logger:Epoch[024/300], Step[1150/1602], Avg Loss: 4.3805, Avg Acc: 0.1660
+INFO:local_logger:Epoch[024/300], Step[1150/1602], Avg Loss: 4.3746, Avg Acc: 0.1669
+INFO:local_logger:Epoch[024/300], Step[1150/1602], Avg Loss: 4.3589, Avg Acc: 0.1743
+INFO:local_logger:Epoch[024/300], Step[1150/1602], Avg Loss: 4.3806, Avg Acc: 0.1733
+INFO:master_logger:Epoch[024/300], Step[1150/1602], Avg Loss: 4.3737, Avg Acc: 0.1701
+INFO:local_logger:Epoch[024/300], Step[1200/1602], Avg Loss: 4.3596, Avg Acc: 0.1744
+INFO:local_logger:Epoch[024/300], Step[1200/1602], Avg Loss: 4.3801, Avg Acc: 0.1663
+INFO:local_logger:Epoch[024/300], Step[1200/1602], Avg Loss: 4.3717, Avg Acc: 0.1680
+INFO:local_logger:Epoch[024/300], Step[1200/1602], Avg Loss: 4.3749, Avg Acc: 0.1739
+INFO:master_logger:Epoch[024/300], Step[1200/1602], Avg Loss: 4.3716, Avg Acc: 0.1706
+INFO:local_logger:Epoch[024/300], Step[1250/1602], Avg Loss: 4.3810, Avg Acc: 0.1664
+INFO:local_logger:Epoch[024/300], Step[1250/1602], Avg Loss: 4.3589, Avg Acc: 0.1745
+INFO:local_logger:Epoch[024/300], Step[1250/1602], Avg Loss: 4.3694, Avg Acc: 0.1685
+INFO:local_logger:Epoch[024/300], Step[1250/1602], Avg Loss: 4.3772, Avg Acc: 0.1725
+INFO:master_logger:Epoch[024/300], Step[1250/1602], Avg Loss: 4.3716, Avg Acc: 0.1705
+INFO:local_logger:Epoch[024/300], Step[1300/1602], Avg Loss: 4.3595, Avg Acc: 0.1739
+INFO:local_logger:Epoch[024/300], Step[1300/1602], Avg Loss: 4.3693, Avg Acc: 0.1685
+INFO:local_logger:Epoch[024/300], Step[1300/1602], Avg Loss: 4.3751, Avg Acc: 0.1669
+INFO:local_logger:Epoch[024/300], Step[1300/1602], Avg Loss: 4.3797, Avg Acc: 0.1715
+INFO:master_logger:Epoch[024/300], Step[1300/1602], Avg Loss: 4.3709, Avg Acc: 0.1702
+INFO:local_logger:Epoch[024/300], Step[1350/1602], Avg Loss: 4.3593, Avg Acc: 0.1749
+INFO:local_logger:Epoch[024/300], Step[1350/1602], Avg Loss: 4.3723, Avg Acc: 0.1679
+INFO:local_logger:Epoch[024/300], Step[1350/1602], Avg Loss: 4.3706, Avg Acc: 0.1684
+INFO:local_logger:Epoch[024/300], Step[1350/1602], Avg Loss: 4.3835, Avg Acc: 0.1709
+INFO:master_logger:Epoch[024/300], Step[1350/1602], Avg Loss: 4.3714, Avg Acc: 0.1705
+INFO:local_logger:Epoch[024/300], Step[1400/1602], Avg Loss: 4.3613, Avg Acc: 0.1747
+INFO:local_logger:Epoch[024/300], Step[1400/1602], Avg Loss: 4.3837, Avg Acc: 0.1709
+INFO:local_logger:Epoch[024/300], Step[1400/1602], Avg Loss: 4.3717, Avg Acc: 0.1684
+INFO:local_logger:Epoch[024/300], Step[1400/1602], Avg Loss: 4.3672, Avg Acc: 0.1686
+INFO:master_logger:Epoch[024/300], Step[1400/1602], Avg Loss: 4.3710, Avg Acc: 0.1707
+INFO:local_logger:Epoch[024/300], Step[1450/1602], Avg Loss: 4.3714, Avg Acc: 0.1682
+INFO:local_logger:Epoch[024/300], Step[1450/1602], Avg Loss: 4.3851, Avg Acc: 0.1707
+INFO:master_logger:Epoch[024/300], Step[1450/1602], Avg Loss: 4.3726, Avg Acc: 0.1708
+INFO:local_logger:Epoch[024/300], Step[1450/1602], Avg Loss: 4.3629, Avg Acc: 0.1751
+INFO:local_logger:Epoch[024/300], Step[1450/1602], Avg Loss: 4.3713, Avg Acc: 0.1692
+INFO:local_logger:Epoch[024/300], Step[1500/1602], Avg Loss: 4.3712, Avg Acc: 0.1688
+INFO:local_logger:Epoch[024/300], Step[1500/1602], Avg Loss: 4.3696, Avg Acc: 0.1696
+INFO:local_logger:Epoch[024/300], Step[1500/1602], Avg Loss: 4.3608, Avg Acc: 0.1753
+INFO:master_logger:Epoch[024/300], Step[1500/1602], Avg Loss: 4.3705, Avg Acc: 0.1715
+INFO:local_logger:Epoch[024/300], Step[1500/1602], Avg Loss: 4.3805, Avg Acc: 0.1721
+INFO:local_logger:Epoch[024/300], Step[1550/1602], Avg Loss: 4.3749, Avg Acc: 0.1687
+INFO:local_logger:Epoch[024/300], Step[1550/1602], Avg Loss: 4.3822, Avg Acc: 0.1721
+INFO:local_logger:Epoch[024/300], Step[1550/1602], Avg Loss: 4.3663, Avg Acc: 0.1699
+INFO:local_logger:Epoch[024/300], Step[1550/1602], Avg Loss: 4.3604, Avg Acc: 0.1752
+INFO:master_logger:Epoch[024/300], Step[1550/1602], Avg Loss: 4.3709, Avg Acc: 0.1715
+INFO:local_logger:Epoch[024/300], Step[1600/1602], Avg Loss: 4.3599, Avg Acc: 0.1746
+INFO:local_logger:Epoch[024/300], Step[1600/1602], Avg Loss: 4.3664, Avg Acc: 0.1702
+INFO:local_logger:Epoch[024/300], Step[1600/1602], Avg Loss: 4.3754, Avg Acc: 0.1730
+INFO:local_logger:Epoch[024/300], Step[1600/1602], Avg Loss: 4.3752, Avg Acc: 0.1689
+INFO:master_logger:Epoch[024/300], Step[1600/1602], Avg Loss: 4.3692, Avg Acc: 0.1717
+INFO:local_logger:----- Epoch[024/300], Train Loss: 4.3750, Train Acc: 0.1731, time: 3686.98
+INFO:local_logger:Now training epoch 25. LR=0.000386
+INFO:local_logger:----- Epoch[024/300], Train Loss: 4.3665, Train Acc: 0.1702, time: 3686.99
+INFO:local_logger:Now training epoch 25. LR=0.000386
+INFO:local_logger:----- Epoch[024/300], Train Loss: 4.3750, Train Acc: 0.1690, time: 3686.80
+INFO:master_logger:----- Epoch[024/300], Train Loss: 4.3692, Train Acc: 0.1717, time: 3686.80
+INFO:local_logger:----- Epoch[024/300], Train Loss: 4.3601, Train Acc: 0.1746, time: 3687.08
+INFO:local_logger:Now training epoch 25. LR=0.000386
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-24-Loss-4.374995217249997.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-24-Loss-4.374995217249997.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-24-Loss-4.374995217249997-EMA.pdparams
+INFO:local_logger:Now training epoch 25. LR=0.000386
+INFO:master_logger:Now training epoch 25. LR=0.000386
+INFO:local_logger:Epoch[025/300], Step[0000/1602], Avg Loss: 4.6618, Avg Acc: 0.2450
+INFO:local_logger:Epoch[025/300], Step[0000/1602], Avg Loss: 3.8914, Avg Acc: 0.0100
+INFO:local_logger:Epoch[025/300], Step[0000/1602], Avg Loss: 4.5971, Avg Acc: 0.0150
+INFO:local_logger:Epoch[025/300], Step[0000/1602], Avg Loss: 3.3092, Avg Acc: 0.0000
+INFO:master_logger:Epoch[025/300], Step[0000/1602], Avg Loss: 4.1149, Avg Acc: 0.0675
+INFO:local_logger:Epoch[025/300], Step[0050/1602], Avg Loss: 4.3306, Avg Acc: 0.1607
+INFO:local_logger:Epoch[025/300], Step[0050/1602], Avg Loss: 4.3190, Avg Acc: 0.1923
+INFO:local_logger:Epoch[025/300], Step[0050/1602], Avg Loss: 4.3610, Avg Acc: 0.1147
+INFO:local_logger:Epoch[025/300], Step[0050/1602], Avg Loss: 4.3599, Avg Acc: 0.1908
+INFO:master_logger:Epoch[025/300], Step[0050/1602], Avg Loss: 4.3426, Avg Acc: 0.1646
+INFO:local_logger:Epoch[025/300], Step[0100/1602], Avg Loss: 4.2951, Avg Acc: 0.1559
+INFO:local_logger:Epoch[025/300], Step[0100/1602], Avg Loss: 4.3654, Avg Acc: 0.1571
+INFO:local_logger:Epoch[025/300], Step[0100/1602], Avg Loss: 4.2753, Avg Acc: 0.1973
+INFO:local_logger:Epoch[025/300], Step[0100/1602], Avg Loss: 4.3723, Avg Acc: 0.1762
+INFO:master_logger:Epoch[025/300], Step[0100/1602], Avg Loss: 4.3270, Avg Acc: 0.1716
+INFO:local_logger:Epoch[025/300], Step[0150/1602], Avg Loss: 4.2767, Avg Acc: 0.1714
+INFO:local_logger:Epoch[025/300], Step[0150/1602], Avg Loss: 4.3629, Avg Acc: 0.1621
+INFO:local_logger:Epoch[025/300], Step[0150/1602], Avg Loss: 4.3486, Avg Acc: 0.1712
+INFO:master_logger:Epoch[025/300], Step[0150/1602], Avg Loss: 4.3167, Avg Acc: 0.1758
+INFO:local_logger:Epoch[025/300], Step[0150/1602], Avg Loss: 4.2784, Avg Acc: 0.1987
+INFO:local_logger:Epoch[025/300], Step[0200/1602], Avg Loss: 4.2798, Avg Acc: 0.1750
+INFO:local_logger:Epoch[025/300], Step[0200/1602], Avg Loss: 4.3794, Avg Acc: 0.1663
+INFO:local_logger:Epoch[025/300], Step[0200/1602], Avg Loss: 4.3459, Avg Acc: 0.1731
+INFO:local_logger:Epoch[025/300], Step[0200/1602], Avg Loss: 4.2883, Avg Acc: 0.1903
+INFO:master_logger:Epoch[025/300], Step[0200/1602], Avg Loss: 4.3233, Avg Acc: 0.1762
+INFO:local_logger:Epoch[025/300], Step[0250/1602], Avg Loss: 4.3025, Avg Acc: 0.1742
+INFO:local_logger:Epoch[025/300], Step[0250/1602], Avg Loss: 4.3412, Avg Acc: 0.1736
+INFO:local_logger:Epoch[025/300], Step[0250/1602], Avg Loss: 4.3757, Avg Acc: 0.1663
+INFO:local_logger:Epoch[025/300], Step[0250/1602], Avg Loss: 4.2996, Avg Acc: 0.1867
+INFO:master_logger:Epoch[025/300], Step[0250/1602], Avg Loss: 4.3298, Avg Acc: 0.1752
+INFO:local_logger:Epoch[025/300], Step[0300/1602], Avg Loss: 4.3344, Avg Acc: 0.1692
+INFO:local_logger:Epoch[025/300], Step[0300/1602], Avg Loss: 4.3754, Avg Acc: 0.1641
+INFO:local_logger:Epoch[025/300], Step[0300/1602], Avg Loss: 4.3580, Avg Acc: 0.1724
+INFO:local_logger:Epoch[025/300], Step[0300/1602], Avg Loss: 4.2877, Avg Acc: 0.1863
+INFO:master_logger:Epoch[025/300], Step[0300/1602], Avg Loss: 4.3389, Avg Acc: 0.1730
+INFO:local_logger:Epoch[025/300], Step[0350/1602], Avg Loss: 4.3358, Avg Acc: 0.1686
+INFO:local_logger:Epoch[025/300], Step[0350/1602], Avg Loss: 4.3601, Avg Acc: 0.1722
+INFO:local_logger:Epoch[025/300], Step[0350/1602], Avg Loss: 4.3648, Avg Acc: 0.1650
+INFO:local_logger:Epoch[025/300], Step[0350/1602], Avg Loss: 4.2911, Avg Acc: 0.1820
+INFO:master_logger:Epoch[025/300], Step[0350/1602], Avg Loss: 4.3380, Avg Acc: 0.1719
+INFO:local_logger:Epoch[025/300], Step[0400/1602], Avg Loss: 4.3007, Avg Acc: 0.1818
+INFO:local_logger:Epoch[025/300], Step[0400/1602], Avg Loss: 4.3341, Avg Acc: 0.1671
+INFO:local_logger:Epoch[025/300], Step[0400/1602], Avg Loss: 4.3474, Avg Acc: 0.1725
+INFO:local_logger:Epoch[025/300], Step[0400/1602], Avg Loss: 4.3593, Avg Acc: 0.1633
+INFO:master_logger:Epoch[025/300], Step[0400/1602], Avg Loss: 4.3354, Avg Acc: 0.1712
+INFO:local_logger:Epoch[025/300], Step[0450/1602], Avg Loss: 4.3284, Avg Acc: 0.1660
+INFO:master_logger:Epoch[025/300], Step[0450/1602], Avg Loss: 4.3382, Avg Acc: 0.1707
+INFO:local_logger:Epoch[025/300], Step[0450/1602], Avg Loss: 4.3059, Avg Acc: 0.1824
+INFO:local_logger:Epoch[025/300], Step[0450/1602], Avg Loss: 4.3538, Avg Acc: 0.1720
+INFO:local_logger:Epoch[025/300], Step[0450/1602], Avg Loss: 4.3649, Avg Acc: 0.1622
+INFO:local_logger:Epoch[025/300], Step[0500/1602], Avg Loss: 4.3195, Avg Acc: 0.1715
+INFO:master_logger:Epoch[025/300], Step[0500/1602], Avg Loss: 4.3319, Avg Acc: 0.1732
+INFO:local_logger:Epoch[025/300], Step[0500/1602], Avg Loss: 4.2940, Avg Acc: 0.1833
+INFO:local_logger:Epoch[025/300], Step[0500/1602], Avg Loss: 4.3528, Avg Acc: 0.1755
+INFO:local_logger:Epoch[025/300], Step[0500/1602], Avg Loss: 4.3611, Avg Acc: 0.1625
+INFO:local_logger:Epoch[025/300], Step[0550/1602], Avg Loss: 4.3108, Avg Acc: 0.1679
+INFO:local_logger:Epoch[025/300], Step[0550/1602], Avg Loss: 4.3066, Avg Acc: 0.1840
+INFO:local_logger:Epoch[025/300], Step[0550/1602], Avg Loss: 4.3608, Avg Acc: 0.1637
+INFO:local_logger:Epoch[025/300], Step[0550/1602], Avg Loss: 4.3544, Avg Acc: 0.1759
+INFO:master_logger:Epoch[025/300], Step[0550/1602], Avg Loss: 4.3331, Avg Acc: 0.1729
+INFO:local_logger:Epoch[025/300], Step[0600/1602], Avg Loss: 4.3197, Avg Acc: 0.1666
+INFO:local_logger:Epoch[025/300], Step[0600/1602], Avg Loss: 4.3117, Avg Acc: 0.1849
+INFO:local_logger:Epoch[025/300], Step[0600/1602], Avg Loss: 4.3567, Avg Acc: 0.1739
+INFO:local_logger:Epoch[025/300], Step[0600/1602], Avg Loss: 4.3524, Avg Acc: 0.1668
+INFO:master_logger:Epoch[025/300], Step[0600/1602], Avg Loss: 4.3351, Avg Acc: 0.1731
+INFO:local_logger:Epoch[025/300], Step[0650/1602], Avg Loss: 4.3120, Avg Acc: 0.1665
+INFO:local_logger:Epoch[025/300], Step[0650/1602], Avg Loss: 4.3465, Avg Acc: 0.1655
+INFO:local_logger:Epoch[025/300], Step[0650/1602], Avg Loss: 4.3597, Avg Acc: 0.1754
+INFO:local_logger:Epoch[025/300], Step[0650/1602], Avg Loss: 4.3103, Avg Acc: 0.1839
+INFO:master_logger:Epoch[025/300], Step[0650/1602], Avg Loss: 4.3321, Avg Acc: 0.1728
+INFO:local_logger:Epoch[025/300], Step[0700/1602], Avg Loss: 4.3429, Avg Acc: 0.1676
+INFO:local_logger:Epoch[025/300], Step[0700/1602], Avg Loss: 4.3099, Avg Acc: 0.1840
+INFO:local_logger:Epoch[025/300], Step[0700/1602], Avg Loss: 4.3160, Avg Acc: 0.1689
+INFO:local_logger:Epoch[025/300], Step[0700/1602], Avg Loss: 4.3716, Avg Acc: 0.1754
+INFO:master_logger:Epoch[025/300], Step[0700/1602], Avg Loss: 4.3351, Avg Acc: 0.1740
+INFO:local_logger:Epoch[025/300], Step[0750/1602], Avg Loss: 4.3039, Avg Acc: 0.1706
+INFO:local_logger:Epoch[025/300], Step[0750/1602], Avg Loss: 4.3717, Avg Acc: 0.1767
+INFO:local_logger:Epoch[025/300], Step[0750/1602], Avg Loss: 4.3562, Avg Acc: 0.1680
+INFO:local_logger:Epoch[025/300], Step[0750/1602], Avg Loss: 4.3074, Avg Acc: 0.1855
+INFO:master_logger:Epoch[025/300], Step[0750/1602], Avg Loss: 4.3348, Avg Acc: 0.1752
+INFO:local_logger:Epoch[025/300], Step[0800/1602], Avg Loss: 4.3061, Avg Acc: 0.1701
+INFO:local_logger:Epoch[025/300], Step[0800/1602], Avg Loss: 4.3568, Avg Acc: 0.1693
+INFO:local_logger:Epoch[025/300], Step[0800/1602], Avg Loss: 4.3689, Avg Acc: 0.1770
+INFO:master_logger:Epoch[025/300], Step[0800/1602], Avg Loss: 4.3355, Avg Acc: 0.1754
+INFO:local_logger:Epoch[025/300], Step[0800/1602], Avg Loss: 4.3102, Avg Acc: 0.1854
+INFO:local_logger:Epoch[025/300], Step[0850/1602], Avg Loss: 4.3566, Avg Acc: 0.1702
+INFO:local_logger:Epoch[025/300], Step[0850/1602], Avg Loss: 4.3073, Avg Acc: 0.1708
+INFO:master_logger:Epoch[025/300], Step[0850/1602], Avg Loss: 4.3365, Avg Acc: 0.1755
+INFO:local_logger:Epoch[025/300], Step[0850/1602], Avg Loss: 4.3162, Avg Acc: 0.1846
+INFO:local_logger:Epoch[025/300], Step[0850/1602], Avg Loss: 4.3660, Avg Acc: 0.1766
+INFO:local_logger:Epoch[025/300], Step[0900/1602], Avg Loss: 4.3706, Avg Acc: 0.1763
+INFO:local_logger:Epoch[025/300], Step[0900/1602], Avg Loss: 4.3485, Avg Acc: 0.1713
+INFO:local_logger:Epoch[025/300], Step[0900/1602], Avg Loss: 4.3044, Avg Acc: 0.1718
+INFO:local_logger:Epoch[025/300], Step[0900/1602], Avg Loss: 4.3278, Avg Acc: 0.1825
+INFO:master_logger:Epoch[025/300], Step[0900/1602], Avg Loss: 4.3378, Avg Acc: 0.1755
+INFO:local_logger:Epoch[025/300], Step[0950/1602], Avg Loss: 4.3143, Avg Acc: 0.1825
+INFO:local_logger:Epoch[025/300], Step[0950/1602], Avg Loss: 4.3026, Avg Acc: 0.1721
+INFO:local_logger:Epoch[025/300], Step[0950/1602], Avg Loss: 4.3723, Avg Acc: 0.1745
+INFO:local_logger:Epoch[025/300], Step[0950/1602], Avg Loss: 4.3495, Avg Acc: 0.1708
+INFO:master_logger:Epoch[025/300], Step[0950/1602], Avg Loss: 4.3347, Avg Acc: 0.1750
+INFO:local_logger:Epoch[025/300], Step[1000/1602], Avg Loss: 4.2985, Avg Acc: 0.1717
+INFO:local_logger:Epoch[025/300], Step[1000/1602], Avg Loss: 4.3713, Avg Acc: 0.1740
+INFO:local_logger:Epoch[025/300], Step[1000/1602], Avg Loss: 4.3516, Avg Acc: 0.1691
+INFO:local_logger:Epoch[025/300], Step[1000/1602], Avg Loss: 4.3190, Avg Acc: 0.1828
+INFO:master_logger:Epoch[025/300], Step[1000/1602], Avg Loss: 4.3351, Avg Acc: 0.1744
+INFO:local_logger:Epoch[025/300], Step[1050/1602], Avg Loss: 4.3006, Avg Acc: 0.1726
+INFO:local_logger:Epoch[025/300], Step[1050/1602], Avg Loss: 4.3736, Avg Acc: 0.1734
+INFO:local_logger:Epoch[025/300], Step[1050/1602], Avg Loss: 4.3209, Avg Acc: 0.1806
+INFO:local_logger:Epoch[025/300], Step[1050/1602], Avg Loss: 4.3416, Avg Acc: 0.1711
+INFO:master_logger:Epoch[025/300], Step[1050/1602], Avg Loss: 4.3342, Avg Acc: 0.1744
+INFO:local_logger:Epoch[025/300], Step[1100/1602], Avg Loss: 4.2983, Avg Acc: 0.1724
+INFO:local_logger:Epoch[025/300], Step[1100/1602], Avg Loss: 4.3725, Avg Acc: 0.1733
+INFO:local_logger:Epoch[025/300], Step[1100/1602], Avg Loss: 4.3466, Avg Acc: 0.1705
+INFO:local_logger:Epoch[025/300], Step[1100/1602], Avg Loss: 4.3239, Avg Acc: 0.1799
+INFO:master_logger:Epoch[025/300], Step[1100/1602], Avg Loss: 4.3353, Avg Acc: 0.1740
+INFO:local_logger:Epoch[025/300], Step[1150/1602], Avg Loss: 4.3251, Avg Acc: 0.1797
+INFO:local_logger:Epoch[025/300], Step[1150/1602], Avg Loss: 4.3002, Avg Acc: 0.1734
+INFO:local_logger:Epoch[025/300], Step[1150/1602], Avg Loss: 4.3433, Avg Acc: 0.1702
+INFO:master_logger:Epoch[025/300], Step[1150/1602], Avg Loss: 4.3350, Avg Acc: 0.1739
+INFO:local_logger:Epoch[025/300], Step[1150/1602], Avg Loss: 4.3716, Avg Acc: 0.1725
+INFO:local_logger:Epoch[025/300], Step[1200/1602], Avg Loss: 4.3059, Avg Acc: 0.1720
+INFO:local_logger:Epoch[025/300], Step[1200/1602], Avg Loss: 4.3292, Avg Acc: 0.1796
+INFO:local_logger:Epoch[025/300], Step[1200/1602], Avg Loss: 4.3394, Avg Acc: 0.1713
+INFO:master_logger:Epoch[025/300], Step[1200/1602], Avg Loss: 4.3363, Avg Acc: 0.1738
+INFO:local_logger:Epoch[025/300], Step[1200/1602], Avg Loss: 4.3706, Avg Acc: 0.1724
+INFO:local_logger:Epoch[025/300], Step[1250/1602], Avg Loss: 4.3304, Avg Acc: 0.1803
+INFO:local_logger:Epoch[025/300], Step[1250/1602], Avg Loss: 4.3025, Avg Acc: 0.1721
+INFO:local_logger:Epoch[025/300], Step[1250/1602], Avg Loss: 4.3677, Avg Acc: 0.1718
+INFO:master_logger:Epoch[025/300], Step[1250/1602], Avg Loss: 4.3352, Avg Acc: 0.1738
+INFO:local_logger:Epoch[025/300], Step[1250/1602], Avg Loss: 4.3404, Avg Acc: 0.1709
+INFO:local_logger:Epoch[025/300], Step[1300/1602], Avg Loss: 4.3004, Avg Acc: 0.1716
+INFO:local_logger:Epoch[025/300], Step[1300/1602], Avg Loss: 4.3664, Avg Acc: 0.1732
+INFO:local_logger:Epoch[025/300], Step[1300/1602], Avg Loss: 4.3301, Avg Acc: 0.1805
+INFO:local_logger:Epoch[025/300], Step[1300/1602], Avg Loss: 4.3416, Avg Acc: 0.1703
+INFO:master_logger:Epoch[025/300], Step[1300/1602], Avg Loss: 4.3346, Avg Acc: 0.1739
+INFO:local_logger:Epoch[025/300], Step[1350/1602], Avg Loss: 4.3259, Avg Acc: 0.1814
+INFO:local_logger:Epoch[025/300], Step[1350/1602], Avg Loss: 4.3087, Avg Acc: 0.1716
+INFO:local_logger:Epoch[025/300], Step[1350/1602], Avg Loss: 4.3628, Avg Acc: 0.1729
+INFO:master_logger:Epoch[025/300], Step[1350/1602], Avg Loss: 4.3359, Avg Acc: 0.1738
+INFO:local_logger:Epoch[025/300], Step[1350/1602], Avg Loss: 4.3460, Avg Acc: 0.1694
+INFO:local_logger:Epoch[025/300], Step[1400/1602], Avg Loss: 4.3164, Avg Acc: 0.1713
+INFO:local_logger:Epoch[025/300], Step[1400/1602], Avg Loss: 4.3206, Avg Acc: 0.1811
+INFO:master_logger:Epoch[025/300], Step[1400/1602], Avg Loss: 4.3357, Avg Acc: 0.1736
+INFO:local_logger:Epoch[025/300], Step[1400/1602], Avg Loss: 4.3453, Avg Acc: 0.1691
+INFO:local_logger:Epoch[025/300], Step[1400/1602], Avg Loss: 4.3606, Avg Acc: 0.1730
+INFO:local_logger:Epoch[025/300], Step[1450/1602], Avg Loss: 4.3135, Avg Acc: 0.1723
+INFO:local_logger:Epoch[025/300], Step[1450/1602], Avg Loss: 4.3440, Avg Acc: 0.1696
+INFO:local_logger:Epoch[025/300], Step[1450/1602], Avg Loss: 4.3618, Avg Acc: 0.1718
+INFO:local_logger:Epoch[025/300], Step[1450/1602], Avg Loss: 4.3188, Avg Acc: 0.1813
+INFO:master_logger:Epoch[025/300], Step[1450/1602], Avg Loss: 4.3345, Avg Acc: 0.1737
+INFO:local_logger:Epoch[025/300], Step[1500/1602], Avg Loss: 4.3145, Avg Acc: 0.1719
+INFO:local_logger:Epoch[025/300], Step[1500/1602], Avg Loss: 4.3620, Avg Acc: 0.1717
+INFO:master_logger:Epoch[025/300], Step[1500/1602], Avg Loss: 4.3360, Avg Acc: 0.1734
+INFO:local_logger:Epoch[025/300], Step[1500/1602], Avg Loss: 4.3440, Avg Acc: 0.1695
+INFO:local_logger:Epoch[025/300], Step[1500/1602], Avg Loss: 4.3235, Avg Acc: 0.1806
+INFO:local_logger:Epoch[025/300], Step[1550/1602], Avg Loss: 4.3276, Avg Acc: 0.1800
+INFO:local_logger:Epoch[025/300], Step[1550/1602], Avg Loss: 4.3131, Avg Acc: 0.1731
+INFO:local_logger:Epoch[025/300], Step[1550/1602], Avg Loss: 4.3404, Avg Acc: 0.1699
+INFO:local_logger:Epoch[025/300], Step[1550/1602], Avg Loss: 4.3612, Avg Acc: 0.1713
+INFO:master_logger:Epoch[025/300], Step[1550/1602], Avg Loss: 4.3356, Avg Acc: 0.1736
+INFO:local_logger:Epoch[025/300], Step[1600/1602], Avg Loss: 4.3165, Avg Acc: 0.1734
+INFO:master_logger:Epoch[025/300], Step[1600/1602], Avg Loss: 4.3362, Avg Acc: 0.1737
+INFO:local_logger:Epoch[025/300], Step[1600/1602], Avg Loss: 4.3597, Avg Acc: 0.1714
+INFO:local_logger:Epoch[025/300], Step[1600/1602], Avg Loss: 4.3379, Avg Acc: 0.1701
+INFO:local_logger:Epoch[025/300], Step[1600/1602], Avg Loss: 4.3309, Avg Acc: 0.1799
+INFO:local_logger:----- Epoch[025/300], Train Loss: 4.3596, Train Acc: 0.1714, time: 3673.48
+INFO:local_logger:Now training epoch 26. LR=0.000386
+INFO:local_logger:----- Epoch[025/300], Train Loss: 4.3165, Train Acc: 0.1734, time: 3673.15
+INFO:master_logger:----- Epoch[025/300], Train Loss: 4.3363, Train Acc: 0.1737, time: 3673.15
+INFO:local_logger:----- Epoch[025/300], Train Loss: 4.3381, Train Acc: 0.1701, time: 3673.46
+INFO:local_logger:----- Epoch[025/300], Train Loss: 4.3310, Train Acc: 0.1798, time: 3673.38
+INFO:local_logger:Now training epoch 26. LR=0.000386
+INFO:local_logger:Now training epoch 26. LR=0.000386
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-25-Loss-4.316548676268244.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-25-Loss-4.316548676268244.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-25-Loss-4.316548676268244-EMA.pdparams
+INFO:local_logger:Now training epoch 26. LR=0.000386
+INFO:master_logger:Now training epoch 26. LR=0.000386
+INFO:local_logger:Epoch[026/300], Step[0000/1602], Avg Loss: 4.1128, Avg Acc: 0.2950
+INFO:local_logger:Epoch[026/300], Step[0000/1602], Avg Loss: 3.8363, Avg Acc: 0.2950
+INFO:local_logger:Epoch[026/300], Step[0000/1602], Avg Loss: 3.9183, Avg Acc: 0.3000
+INFO:master_logger:Epoch[026/300], Step[0000/1602], Avg Loss: 4.1408, Avg Acc: 0.2675
+INFO:local_logger:Epoch[026/300], Step[0000/1602], Avg Loss: 4.6957, Avg Acc: 0.1800
+INFO:local_logger:Epoch[026/300], Step[0050/1602], Avg Loss: 4.4733, Avg Acc: 0.1560
+INFO:local_logger:Epoch[026/300], Step[0050/1602], Avg Loss: 4.2911, Avg Acc: 0.1845
+INFO:local_logger:Epoch[026/300], Step[0050/1602], Avg Loss: 4.3826, Avg Acc: 0.1857
+INFO:local_logger:Epoch[026/300], Step[0050/1602], Avg Loss: 4.3325, Avg Acc: 0.1881
+INFO:master_logger:Epoch[026/300], Step[0050/1602], Avg Loss: 4.3699, Avg Acc: 0.1786
+INFO:local_logger:Epoch[026/300], Step[0100/1602], Avg Loss: 4.3265, Avg Acc: 0.1820
+INFO:local_logger:Epoch[026/300], Step[0100/1602], Avg Loss: 4.3915, Avg Acc: 0.1757
+INFO:local_logger:Epoch[026/300], Step[0100/1602], Avg Loss: 4.3398, Avg Acc: 0.1827
+INFO:local_logger:Epoch[026/300], Step[0100/1602], Avg Loss: 4.3594, Avg Acc: 0.1846
+INFO:master_logger:Epoch[026/300], Step[0100/1602], Avg Loss: 4.3543, Avg Acc: 0.1813
+INFO:local_logger:Epoch[026/300], Step[0150/1602], Avg Loss: 4.3696, Avg Acc: 0.1794
+INFO:local_logger:Epoch[026/300], Step[0150/1602], Avg Loss: 4.3183, Avg Acc: 0.1839
+INFO:local_logger:Epoch[026/300], Step[0150/1602], Avg Loss: 4.3220, Avg Acc: 0.1863
+INFO:master_logger:Epoch[026/300], Step[0150/1602], Avg Loss: 4.3375, Avg Acc: 0.1839
+INFO:local_logger:Epoch[026/300], Step[0150/1602], Avg Loss: 4.3402, Avg Acc: 0.1860
+INFO:local_logger:Epoch[026/300], Step[0200/1602], Avg Loss: 4.3151, Avg Acc: 0.1828
+INFO:local_logger:Epoch[026/300], Step[0200/1602], Avg Loss: 4.3586, Avg Acc: 0.1842
+INFO:local_logger:Epoch[026/300], Step[0200/1602], Avg Loss: 4.3204, Avg Acc: 0.1743
+INFO:master_logger:Epoch[026/300], Step[0200/1602], Avg Loss: 4.3419, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[0200/1602], Avg Loss: 4.3736, Avg Acc: 0.1779
+INFO:local_logger:Epoch[026/300], Step[0250/1602], Avg Loss: 4.2903, Avg Acc: 0.1852
+INFO:local_logger:Epoch[026/300], Step[0250/1602], Avg Loss: 4.3132, Avg Acc: 0.1818
+INFO:local_logger:Epoch[026/300], Step[0250/1602], Avg Loss: 4.3373, Avg Acc: 0.1774
+INFO:master_logger:Epoch[026/300], Step[0250/1602], Avg Loss: 4.3255, Avg Acc: 0.1795
+INFO:local_logger:Epoch[026/300], Step[0250/1602], Avg Loss: 4.3614, Avg Acc: 0.1734
+INFO:local_logger:Epoch[026/300], Step[0300/1602], Avg Loss: 4.2844, Avg Acc: 0.1906
+INFO:local_logger:Epoch[026/300], Step[0300/1602], Avg Loss: 4.3333, Avg Acc: 0.1807
+INFO:local_logger:Epoch[026/300], Step[0300/1602], Avg Loss: 4.3059, Avg Acc: 0.1785
+INFO:local_logger:Epoch[026/300], Step[0300/1602], Avg Loss: 4.3567, Avg Acc: 0.1736
+INFO:master_logger:Epoch[026/300], Step[0300/1602], Avg Loss: 4.3201, Avg Acc: 0.1808
+INFO:local_logger:Epoch[026/300], Step[0350/1602], Avg Loss: 4.2843, Avg Acc: 0.1887
+INFO:local_logger:Epoch[026/300], Step[0350/1602], Avg Loss: 4.3003, Avg Acc: 0.1839
+INFO:local_logger:Epoch[026/300], Step[0350/1602], Avg Loss: 4.3312, Avg Acc: 0.1775
+INFO:local_logger:Epoch[026/300], Step[0350/1602], Avg Loss: 4.3628, Avg Acc: 0.1723
+INFO:master_logger:Epoch[026/300], Step[0350/1602], Avg Loss: 4.3197, Avg Acc: 0.1806
+INFO:local_logger:Epoch[026/300], Step[0400/1602], Avg Loss: 4.2768, Avg Acc: 0.1861
+INFO:local_logger:Epoch[026/300], Step[0400/1602], Avg Loss: 4.3423, Avg Acc: 0.1737
+INFO:local_logger:Epoch[026/300], Step[0400/1602], Avg Loss: 4.3256, Avg Acc: 0.1776
+INFO:local_logger:Epoch[026/300], Step[0400/1602], Avg Loss: 4.3019, Avg Acc: 0.1822
+INFO:master_logger:Epoch[026/300], Step[0400/1602], Avg Loss: 4.3117, Avg Acc: 0.1799
+INFO:local_logger:Epoch[026/300], Step[0450/1602], Avg Loss: 4.2862, Avg Acc: 0.1838
+INFO:local_logger:Epoch[026/300], Step[0450/1602], Avg Loss: 4.2945, Avg Acc: 0.1829
+INFO:local_logger:Epoch[026/300], Step[0450/1602], Avg Loss: 4.3254, Avg Acc: 0.1756
+INFO:local_logger:Epoch[026/300], Step[0450/1602], Avg Loss: 4.3256, Avg Acc: 0.1782
+INFO:master_logger:Epoch[026/300], Step[0450/1602], Avg Loss: 4.3079, Avg Acc: 0.1801
+INFO:local_logger:Epoch[026/300], Step[0500/1602], Avg Loss: 4.2876, Avg Acc: 0.1824
+INFO:local_logger:Epoch[026/300], Step[0500/1602], Avg Loss: 4.3101, Avg Acc: 0.1791
+INFO:local_logger:Epoch[026/300], Step[0500/1602], Avg Loss: 4.2872, Avg Acc: 0.1848
+INFO:local_logger:Epoch[026/300], Step[0500/1602], Avg Loss: 4.3212, Avg Acc: 0.1743
+INFO:master_logger:Epoch[026/300], Step[0500/1602], Avg Loss: 4.3015, Avg Acc: 0.1801
+INFO:local_logger:Epoch[026/300], Step[0550/1602], Avg Loss: 4.3244, Avg Acc: 0.1750
+INFO:local_logger:Epoch[026/300], Step[0550/1602], Avg Loss: 4.2919, Avg Acc: 0.1832
+INFO:local_logger:Epoch[026/300], Step[0550/1602], Avg Loss: 4.3013, Avg Acc: 0.1802
+INFO:local_logger:Epoch[026/300], Step[0550/1602], Avg Loss: 4.3004, Avg Acc: 0.1815
+INFO:master_logger:Epoch[026/300], Step[0550/1602], Avg Loss: 4.3045, Avg Acc: 0.1800
+INFO:local_logger:Epoch[026/300], Step[0600/1602], Avg Loss: 4.3040, Avg Acc: 0.1831
+INFO:local_logger:Epoch[026/300], Step[0600/1602], Avg Loss: 4.3239, Avg Acc: 0.1740
+INFO:local_logger:Epoch[026/300], Step[0600/1602], Avg Loss: 4.3187, Avg Acc: 0.1792
+INFO:local_logger:Epoch[026/300], Step[0600/1602], Avg Loss: 4.3021, Avg Acc: 0.1824
+INFO:master_logger:Epoch[026/300], Step[0600/1602], Avg Loss: 4.3122, Avg Acc: 0.1797
+INFO:local_logger:Epoch[026/300], Step[0650/1602], Avg Loss: 4.3125, Avg Acc: 0.1839
+INFO:local_logger:Epoch[026/300], Step[0650/1602], Avg Loss: 4.3106, Avg Acc: 0.1796
+INFO:local_logger:Epoch[026/300], Step[0650/1602], Avg Loss: 4.3038, Avg Acc: 0.1830
+INFO:local_logger:Epoch[026/300], Step[0650/1602], Avg Loss: 4.3205, Avg Acc: 0.1751
+INFO:master_logger:Epoch[026/300], Step[0650/1602], Avg Loss: 4.3119, Avg Acc: 0.1804
+INFO:local_logger:Epoch[026/300], Step[0700/1602], Avg Loss: 4.3165, Avg Acc: 0.1754
+INFO:local_logger:Epoch[026/300], Step[0700/1602], Avg Loss: 4.3053, Avg Acc: 0.1838
+INFO:local_logger:Epoch[026/300], Step[0700/1602], Avg Loss: 4.3165, Avg Acc: 0.1781
+INFO:local_logger:Epoch[026/300], Step[0700/1602], Avg Loss: 4.2970, Avg Acc: 0.1819
+INFO:master_logger:Epoch[026/300], Step[0700/1602], Avg Loss: 4.3088, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[0750/1602], Avg Loss: 4.2986, Avg Acc: 0.1797
+INFO:local_logger:Epoch[026/300], Step[0750/1602], Avg Loss: 4.3007, Avg Acc: 0.1834
+INFO:local_logger:Epoch[026/300], Step[0750/1602], Avg Loss: 4.3173, Avg Acc: 0.1763
+INFO:local_logger:Epoch[026/300], Step[0750/1602], Avg Loss: 4.3131, Avg Acc: 0.1763
+INFO:master_logger:Epoch[026/300], Step[0750/1602], Avg Loss: 4.3074, Avg Acc: 0.1789
+INFO:local_logger:Epoch[026/300], Step[0800/1602], Avg Loss: 4.2964, Avg Acc: 0.1825
+INFO:local_logger:Epoch[026/300], Step[0800/1602], Avg Loss: 4.2929, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[0800/1602], Avg Loss: 4.3229, Avg Acc: 0.1759
+INFO:local_logger:Epoch[026/300], Step[0800/1602], Avg Loss: 4.3069, Avg Acc: 0.1765
+INFO:master_logger:Epoch[026/300], Step[0800/1602], Avg Loss: 4.3048, Avg Acc: 0.1787
+INFO:local_logger:Epoch[026/300], Step[0850/1602], Avg Loss: 4.2923, Avg Acc: 0.1837
+INFO:local_logger:Epoch[026/300], Step[0850/1602], Avg Loss: 4.3021, Avg Acc: 0.1814
+INFO:local_logger:Epoch[026/300], Step[0850/1602], Avg Loss: 4.3240, Avg Acc: 0.1744
+INFO:local_logger:Epoch[026/300], Step[0850/1602], Avg Loss: 4.3003, Avg Acc: 0.1795
+INFO:master_logger:Epoch[026/300], Step[0850/1602], Avg Loss: 4.3047, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[0900/1602], Avg Loss: 4.2953, Avg Acc: 0.1826
+INFO:local_logger:Epoch[026/300], Step[0900/1602], Avg Loss: 4.3225, Avg Acc: 0.1753
+INFO:local_logger:Epoch[026/300], Step[0900/1602], Avg Loss: 4.3046, Avg Acc: 0.1822
+INFO:local_logger:Epoch[026/300], Step[0900/1602], Avg Loss: 4.3019, Avg Acc: 0.1789
+INFO:master_logger:Epoch[026/300], Step[0900/1602], Avg Loss: 4.3061, Avg Acc: 0.1797
+INFO:local_logger:Epoch[026/300], Step[0950/1602], Avg Loss: 4.3036, Avg Acc: 0.1818
+INFO:local_logger:Epoch[026/300], Step[0950/1602], Avg Loss: 4.3218, Avg Acc: 0.1746
+INFO:local_logger:Epoch[026/300], Step[0950/1602], Avg Loss: 4.3072, Avg Acc: 0.1835
+INFO:master_logger:Epoch[026/300], Step[0950/1602], Avg Loss: 4.3098, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[0950/1602], Avg Loss: 4.3066, Avg Acc: 0.1792
+INFO:local_logger:Epoch[026/300], Step[1000/1602], Avg Loss: 4.3032, Avg Acc: 0.1809
+INFO:local_logger:Epoch[026/300], Step[1000/1602], Avg Loss: 4.3090, Avg Acc: 0.1797
+INFO:local_logger:Epoch[026/300], Step[1000/1602], Avg Loss: 4.3298, Avg Acc: 0.1753
+INFO:local_logger:Epoch[026/300], Step[1000/1602], Avg Loss: 4.3076, Avg Acc: 0.1820
+INFO:master_logger:Epoch[026/300], Step[1000/1602], Avg Loss: 4.3124, Avg Acc: 0.1795
+INFO:local_logger:Epoch[026/300], Step[1050/1602], Avg Loss: 4.3045, Avg Acc: 0.1813
+INFO:local_logger:Epoch[026/300], Step[1050/1602], Avg Loss: 4.3327, Avg Acc: 0.1756
+INFO:local_logger:Epoch[026/300], Step[1050/1602], Avg Loss: 4.3030, Avg Acc: 0.1805
+INFO:local_logger:Epoch[026/300], Step[1050/1602], Avg Loss: 4.3106, Avg Acc: 0.1791
+INFO:master_logger:Epoch[026/300], Step[1050/1602], Avg Loss: 4.3127, Avg Acc: 0.1791
+INFO:local_logger:Epoch[026/300], Step[1100/1602], Avg Loss: 4.3098, Avg Acc: 0.1802
+INFO:local_logger:Epoch[026/300], Step[1100/1602], Avg Loss: 4.3003, Avg Acc: 0.1820
+INFO:local_logger:Epoch[026/300], Step[1100/1602], Avg Loss: 4.3046, Avg Acc: 0.1805
+INFO:local_logger:Epoch[026/300], Step[1100/1602], Avg Loss: 4.3340, Avg Acc: 0.1745
+INFO:master_logger:Epoch[026/300], Step[1100/1602], Avg Loss: 4.3122, Avg Acc: 0.1793
+INFO:local_logger:Epoch[026/300], Step[1150/1602], Avg Loss: 4.3167, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[1150/1602], Avg Loss: 4.2982, Avg Acc: 0.1821
+INFO:local_logger:Epoch[026/300], Step[1150/1602], Avg Loss: 4.3027, Avg Acc: 0.1819
+INFO:master_logger:Epoch[026/300], Step[1150/1602], Avg Loss: 4.3113, Avg Acc: 0.1793
+INFO:local_logger:Epoch[026/300], Step[1150/1602], Avg Loss: 4.3277, Avg Acc: 0.1733
+INFO:local_logger:Epoch[026/300], Step[1200/1602], Avg Loss: 4.3160, Avg Acc: 0.1808
+INFO:local_logger:Epoch[026/300], Step[1200/1602], Avg Loss: 4.3000, Avg Acc: 0.1822
+INFO:local_logger:Epoch[026/300], Step[1200/1602], Avg Loss: 4.2941, Avg Acc: 0.1828
+INFO:local_logger:Epoch[026/300], Step[1200/1602], Avg Loss: 4.3273, Avg Acc: 0.1742
+INFO:master_logger:Epoch[026/300], Step[1200/1602], Avg Loss: 4.3093, Avg Acc: 0.1800
+INFO:local_logger:Epoch[026/300], Step[1250/1602], Avg Loss: 4.3185, Avg Acc: 0.1806
+INFO:master_logger:Epoch[026/300], Step[1250/1602], Avg Loss: 4.3121, Avg Acc: 0.1796
+INFO:local_logger:Epoch[026/300], Step[1250/1602], Avg Loss: 4.3278, Avg Acc: 0.1740
+INFO:local_logger:Epoch[026/300], Step[1250/1602], Avg Loss: 4.2975, Avg Acc: 0.1815
+INFO:local_logger:Epoch[026/300], Step[1250/1602], Avg Loss: 4.3048, Avg Acc: 0.1822
+INFO:local_logger:Epoch[026/300], Step[1300/1602], Avg Loss: 4.3168, Avg Acc: 0.1813
+INFO:local_logger:Epoch[026/300], Step[1300/1602], Avg Loss: 4.3302, Avg Acc: 0.1734
+INFO:local_logger:Epoch[026/300], Step[1300/1602], Avg Loss: 4.2975, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[1300/1602], Avg Loss: 4.3036, Avg Acc: 0.1811
+INFO:master_logger:Epoch[026/300], Step[1300/1602], Avg Loss: 4.3120, Avg Acc: 0.1789
+INFO:local_logger:Epoch[026/300], Step[1350/1602], Avg Loss: 4.3113, Avg Acc: 0.1808
+INFO:local_logger:Epoch[026/300], Step[1350/1602], Avg Loss: 4.3019, Avg Acc: 0.1801
+INFO:local_logger:Epoch[026/300], Step[1350/1602], Avg Loss: 4.2937, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[1350/1602], Avg Loss: 4.3312, Avg Acc: 0.1742
+INFO:master_logger:Epoch[026/300], Step[1350/1602], Avg Loss: 4.3095, Avg Acc: 0.1787
+INFO:local_logger:Epoch[026/300], Step[1400/1602], Avg Loss: 4.3311, Avg Acc: 0.1744
+INFO:local_logger:Epoch[026/300], Step[1400/1602], Avg Loss: 4.2970, Avg Acc: 0.1802
+INFO:local_logger:Epoch[026/300], Step[1400/1602], Avg Loss: 4.3008, Avg Acc: 0.1802
+INFO:local_logger:Epoch[026/300], Step[1400/1602], Avg Loss: 4.3138, Avg Acc: 0.1804
+INFO:master_logger:Epoch[026/300], Step[1400/1602], Avg Loss: 4.3107, Avg Acc: 0.1788
+INFO:local_logger:Epoch[026/300], Step[1450/1602], Avg Loss: 4.3009, Avg Acc: 0.1799
+INFO:local_logger:Epoch[026/300], Step[1450/1602], Avg Loss: 4.3163, Avg Acc: 0.1787
+INFO:local_logger:Epoch[026/300], Step[1450/1602], Avg Loss: 4.3009, Avg Acc: 0.1800
+INFO:local_logger:Epoch[026/300], Step[1450/1602], Avg Loss: 4.3300, Avg Acc: 0.1740
+INFO:master_logger:Epoch[026/300], Step[1450/1602], Avg Loss: 4.3120, Avg Acc: 0.1781
+INFO:local_logger:Epoch[026/300], Step[1500/1602], Avg Loss: 4.3075, Avg Acc: 0.1798
+INFO:local_logger:Epoch[026/300], Step[1500/1602], Avg Loss: 4.3150, Avg Acc: 0.1791
+INFO:local_logger:Epoch[026/300], Step[1500/1602], Avg Loss: 4.2979, Avg Acc: 0.1798
+INFO:master_logger:Epoch[026/300], Step[1500/1602], Avg Loss: 4.3134, Avg Acc: 0.1781
+INFO:local_logger:Epoch[026/300], Step[1500/1602], Avg Loss: 4.3331, Avg Acc: 0.1738
+INFO:local_logger:Epoch[026/300], Step[1550/1602], Avg Loss: 4.3179, Avg Acc: 0.1792
+INFO:master_logger:Epoch[026/300], Step[1550/1602], Avg Loss: 4.3137, Avg Acc: 0.1786
+INFO:local_logger:Epoch[026/300], Step[1550/1602], Avg Loss: 4.2977, Avg Acc: 0.1804
+INFO:local_logger:Epoch[026/300], Step[1550/1602], Avg Loss: 4.3071, Avg Acc: 0.1807
+INFO:local_logger:Epoch[026/300], Step[1550/1602], Avg Loss: 4.3320, Avg Acc: 0.1741
+INFO:local_logger:Epoch[026/300], Step[1600/1602], Avg Loss: 4.3021, Avg Acc: 0.1802
+INFO:local_logger:Epoch[026/300], Step[1600/1602], Avg Loss: 4.3310, Avg Acc: 0.1739
+INFO:local_logger:Epoch[026/300], Step[1600/1602], Avg Loss: 4.3120, Avg Acc: 0.1786
+INFO:local_logger:Epoch[026/300], Step[1600/1602], Avg Loss: 4.3062, Avg Acc: 0.1810
+INFO:master_logger:Epoch[026/300], Step[1600/1602], Avg Loss: 4.3128, Avg Acc: 0.1784
+INFO:local_logger:----- Epoch[026/300], Train Loss: 4.3021, Train Acc: 0.1801, time: 3696.54
+INFO:local_logger:Now training epoch 27. LR=0.000385
+INFO:local_logger:----- Epoch[026/300], Train Loss: 4.3060, Train Acc: 0.1811, time: 3696.79
+INFO:local_logger:Now training epoch 27. LR=0.000385
+INFO:local_logger:----- Epoch[026/300], Train Loss: 4.3313, Train Acc: 0.1739, time: 3696.80
+INFO:local_logger:Now training epoch 27. LR=0.000385
+INFO:local_logger:----- Epoch[026/300], Train Loss: 4.3119, Train Acc: 0.1787, time: 3696.55
+INFO:master_logger:----- Epoch[026/300], Train Loss: 4.3128, Train Acc: 0.1784, time: 3696.55
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-26-Loss-4.3119477811863165.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-26-Loss-4.3119477811863165.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-26-Loss-4.3119477811863165-EMA.pdparams
+INFO:local_logger:Now training epoch 27. LR=0.000385
+INFO:master_logger:Now training epoch 27. LR=0.000385
+INFO:local_logger:Epoch[027/300], Step[0000/1602], Avg Loss: 4.9183, Avg Acc: 0.0800
+INFO:local_logger:Epoch[027/300], Step[0000/1602], Avg Loss: 4.2903, Avg Acc: 0.3200
+INFO:local_logger:Epoch[027/300], Step[0000/1602], Avg Loss: 4.2550, Avg Acc: 0.2750
+INFO:local_logger:Epoch[027/300], Step[0000/1602], Avg Loss: 3.4972, Avg Acc: 0.3900
+INFO:master_logger:Epoch[027/300], Step[0000/1602], Avg Loss: 4.2402, Avg Acc: 0.2662
+INFO:local_logger:Epoch[027/300], Step[0050/1602], Avg Loss: 4.4763, Avg Acc: 0.1726
+INFO:local_logger:Epoch[027/300], Step[0050/1602], Avg Loss: 4.0896, Avg Acc: 0.2088
+INFO:local_logger:Epoch[027/300], Step[0050/1602], Avg Loss: 4.2551, Avg Acc: 0.1859
+INFO:master_logger:Epoch[027/300], Step[0050/1602], Avg Loss: 4.2572, Avg Acc: 0.1919
+INFO:local_logger:Epoch[027/300], Step[0050/1602], Avg Loss: 4.2079, Avg Acc: 0.2003
+INFO:local_logger:Epoch[027/300], Step[0100/1602], Avg Loss: 4.3765, Avg Acc: 0.1818
+INFO:local_logger:Epoch[027/300], Step[0100/1602], Avg Loss: 4.1599, Avg Acc: 0.1919
+INFO:local_logger:Epoch[027/300], Step[0100/1602], Avg Loss: 4.2514, Avg Acc: 0.1835
+INFO:local_logger:Epoch[027/300], Step[0100/1602], Avg Loss: 4.2763, Avg Acc: 0.1754
+INFO:master_logger:Epoch[027/300], Step[0100/1602], Avg Loss: 4.2660, Avg Acc: 0.1832
+INFO:local_logger:Epoch[027/300], Step[0150/1602], Avg Loss: 4.3635, Avg Acc: 0.1860
+INFO:local_logger:Epoch[027/300], Step[0150/1602], Avg Loss: 4.2743, Avg Acc: 0.1841
+INFO:local_logger:Epoch[027/300], Step[0150/1602], Avg Loss: 4.2207, Avg Acc: 0.1913
+INFO:master_logger:Epoch[027/300], Step[0150/1602], Avg Loss: 4.2761, Avg Acc: 0.1877
+INFO:local_logger:Epoch[027/300], Step[0150/1602], Avg Loss: 4.2458, Avg Acc: 0.1894
+INFO:local_logger:Epoch[027/300], Step[0200/1602], Avg Loss: 4.3484, Avg Acc: 0.1896
+INFO:local_logger:Epoch[027/300], Step[0200/1602], Avg Loss: 4.2476, Avg Acc: 0.1851
+INFO:local_logger:Epoch[027/300], Step[0200/1602], Avg Loss: 4.2577, Avg Acc: 0.1865
+INFO:local_logger:Epoch[027/300], Step[0200/1602], Avg Loss: 4.2959, Avg Acc: 0.1802
+INFO:master_logger:Epoch[027/300], Step[0200/1602], Avg Loss: 4.2874, Avg Acc: 0.1854
+INFO:local_logger:Epoch[027/300], Step[0250/1602], Avg Loss: 4.3192, Avg Acc: 0.1928
+INFO:local_logger:Epoch[027/300], Step[0250/1602], Avg Loss: 4.2822, Avg Acc: 0.1803
+INFO:local_logger:Epoch[027/300], Step[0250/1602], Avg Loss: 4.2897, Avg Acc: 0.1809
+INFO:master_logger:Epoch[027/300], Step[0250/1602], Avg Loss: 4.2920, Avg Acc: 0.1847
+INFO:local_logger:Epoch[027/300], Step[0250/1602], Avg Loss: 4.2768, Avg Acc: 0.1847
+INFO:local_logger:Epoch[027/300], Step[0300/1602], Avg Loss: 4.3044, Avg Acc: 0.1909
+INFO:local_logger:Epoch[027/300], Step[0300/1602], Avg Loss: 4.2965, Avg Acc: 0.1816
+INFO:local_logger:Epoch[027/300], Step[0300/1602], Avg Loss: 4.2876, Avg Acc: 0.1802
+INFO:local_logger:Epoch[027/300], Step[0300/1602], Avg Loss: 4.2551, Avg Acc: 0.1822
+INFO:master_logger:Epoch[027/300], Step[0300/1602], Avg Loss: 4.2859, Avg Acc: 0.1837
+INFO:local_logger:Epoch[027/300], Step[0350/1602], Avg Loss: 4.3079, Avg Acc: 0.1925
+INFO:local_logger:Epoch[027/300], Step[0350/1602], Avg Loss: 4.3006, Avg Acc: 0.1786
+INFO:local_logger:Epoch[027/300], Step[0350/1602], Avg Loss: 4.3113, Avg Acc: 0.1799
+INFO:master_logger:Epoch[027/300], Step[0350/1602], Avg Loss: 4.2937, Avg Acc: 0.1826
+INFO:local_logger:Epoch[027/300], Step[0350/1602], Avg Loss: 4.2549, Avg Acc: 0.1794
+INFO:local_logger:Epoch[027/300], Step[0400/1602], Avg Loss: 4.3023, Avg Acc: 0.1927
+INFO:local_logger:Epoch[027/300], Step[0400/1602], Avg Loss: 4.3106, Avg Acc: 0.1798
+INFO:local_logger:Epoch[027/300], Step[0400/1602], Avg Loss: 4.3105, Avg Acc: 0.1743
+INFO:master_logger:Epoch[027/300], Step[0400/1602], Avg Loss: 4.3018, Avg Acc: 0.1801
+INFO:local_logger:Epoch[027/300], Step[0400/1602], Avg Loss: 4.2838, Avg Acc: 0.1735
+INFO:local_logger:Epoch[027/300], Step[0450/1602], Avg Loss: 4.3099, Avg Acc: 0.1923
+INFO:local_logger:Epoch[027/300], Step[0450/1602], Avg Loss: 4.3118, Avg Acc: 0.1755
+INFO:local_logger:Epoch[027/300], Step[0450/1602], Avg Loss: 4.3054, Avg Acc: 0.1812
+INFO:local_logger:Epoch[027/300], Step[0450/1602], Avg Loss: 4.2637, Avg Acc: 0.1756
+INFO:master_logger:Epoch[027/300], Step[0450/1602], Avg Loss: 4.2977, Avg Acc: 0.1811
+INFO:local_logger:Epoch[027/300], Step[0500/1602], Avg Loss: 4.2687, Avg Acc: 0.1730
+INFO:local_logger:Epoch[027/300], Step[0500/1602], Avg Loss: 4.3117, Avg Acc: 0.1886
+INFO:local_logger:Epoch[027/300], Step[0500/1602], Avg Loss: 4.3014, Avg Acc: 0.1815
+INFO:local_logger:Epoch[027/300], Step[0500/1602], Avg Loss: 4.3047, Avg Acc: 0.1787
+INFO:master_logger:Epoch[027/300], Step[0500/1602], Avg Loss: 4.2966, Avg Acc: 0.1805
+INFO:local_logger:Epoch[027/300], Step[0550/1602], Avg Loss: 4.3244, Avg Acc: 0.1863
+INFO:local_logger:Epoch[027/300], Step[0550/1602], Avg Loss: 4.3102, Avg Acc: 0.1821
+INFO:local_logger:Epoch[027/300], Step[0550/1602], Avg Loss: 4.2992, Avg Acc: 0.1797
+INFO:master_logger:Epoch[027/300], Step[0550/1602], Avg Loss: 4.3003, Avg Acc: 0.1808
+INFO:local_logger:Epoch[027/300], Step[0550/1602], Avg Loss: 4.2675, Avg Acc: 0.1751
+INFO:local_logger:Epoch[027/300], Step[0600/1602], Avg Loss: 4.3174, Avg Acc: 0.1877
+INFO:local_logger:Epoch[027/300], Step[0600/1602], Avg Loss: 4.2641, Avg Acc: 0.1752
+INFO:local_logger:Epoch[027/300], Step[0600/1602], Avg Loss: 4.3071, Avg Acc: 0.1812
+INFO:local_logger:Epoch[027/300], Step[0600/1602], Avg Loss: 4.2937, Avg Acc: 0.1779
+INFO:master_logger:Epoch[027/300], Step[0600/1602], Avg Loss: 4.2956, Avg Acc: 0.1805
+INFO:local_logger:Epoch[027/300], Step[0650/1602], Avg Loss: 4.3170, Avg Acc: 0.1880
+INFO:local_logger:Epoch[027/300], Step[0650/1602], Avg Loss: 4.3086, Avg Acc: 0.1811
+INFO:local_logger:Epoch[027/300], Step[0650/1602], Avg Loss: 4.2921, Avg Acc: 0.1783
+INFO:local_logger:Epoch[027/300], Step[0650/1602], Avg Loss: 4.2760, Avg Acc: 0.1731
+INFO:master_logger:Epoch[027/300], Step[0650/1602], Avg Loss: 4.2984, Avg Acc: 0.1801
+INFO:local_logger:Epoch[027/300], Step[0700/1602], Avg Loss: 4.3037, Avg Acc: 0.1812
+INFO:local_logger:Epoch[027/300], Step[0700/1602], Avg Loss: 4.2820, Avg Acc: 0.1802
+INFO:local_logger:Epoch[027/300], Step[0700/1602], Avg Loss: 4.3097, Avg Acc: 0.1892
+INFO:local_logger:Epoch[027/300], Step[0700/1602], Avg Loss: 4.2790, Avg Acc: 0.1747
+INFO:master_logger:Epoch[027/300], Step[0700/1602], Avg Loss: 4.2936, Avg Acc: 0.1813
+INFO:local_logger:Epoch[027/300], Step[0750/1602], Avg Loss: 4.3000, Avg Acc: 0.1890
+INFO:local_logger:Epoch[027/300], Step[0750/1602], Avg Loss: 4.2872, Avg Acc: 0.1792
+INFO:local_logger:Epoch[027/300], Step[0750/1602], Avg Loss: 4.2967, Avg Acc: 0.1815
+INFO:local_logger:Epoch[027/300], Step[0750/1602], Avg Loss: 4.2777, Avg Acc: 0.1750
+INFO:master_logger:Epoch[027/300], Step[0750/1602], Avg Loss: 4.2904, Avg Acc: 0.1812
+INFO:local_logger:Epoch[027/300], Step[0800/1602], Avg Loss: 4.2979, Avg Acc: 0.1887
+INFO:local_logger:Epoch[027/300], Step[0800/1602], Avg Loss: 4.2806, Avg Acc: 0.1749
+INFO:local_logger:Epoch[027/300], Step[0800/1602], Avg Loss: 4.3019, Avg Acc: 0.1789
+INFO:local_logger:Epoch[027/300], Step[0800/1602], Avg Loss: 4.2929, Avg Acc: 0.1793
+INFO:master_logger:Epoch[027/300], Step[0800/1602], Avg Loss: 4.2933, Avg Acc: 0.1804
+INFO:local_logger:Epoch[027/300], Step[0850/1602], Avg Loss: 4.2994, Avg Acc: 0.1867
+INFO:local_logger:Epoch[027/300], Step[0850/1602], Avg Loss: 4.3056, Avg Acc: 0.1792
+INFO:local_logger:Epoch[027/300], Step[0850/1602], Avg Loss: 4.2960, Avg Acc: 0.1782
+INFO:local_logger:Epoch[027/300], Step[0850/1602], Avg Loss: 4.2864, Avg Acc: 0.1742
+INFO:master_logger:Epoch[027/300], Step[0850/1602], Avg Loss: 4.2968, Avg Acc: 0.1796
+INFO:local_logger:Epoch[027/300], Step[0900/1602], Avg Loss: 4.2967, Avg Acc: 0.1872
+INFO:local_logger:Epoch[027/300], Step[0900/1602], Avg Loss: 4.2965, Avg Acc: 0.1769
+INFO:local_logger:Epoch[027/300], Step[0900/1602], Avg Loss: 4.2905, Avg Acc: 0.1752
+INFO:local_logger:Epoch[027/300], Step[0900/1602], Avg Loss: 4.3157, Avg Acc: 0.1779
+INFO:master_logger:Epoch[027/300], Step[0900/1602], Avg Loss: 4.2998, Avg Acc: 0.1793
+INFO:local_logger:Epoch[027/300], Step[0950/1602], Avg Loss: 4.2934, Avg Acc: 0.1869
+INFO:local_logger:Epoch[027/300], Step[0950/1602], Avg Loss: 4.2986, Avg Acc: 0.1777
+INFO:local_logger:Epoch[027/300], Step[0950/1602], Avg Loss: 4.3141, Avg Acc: 0.1789
+INFO:master_logger:Epoch[027/300], Step[0950/1602], Avg Loss: 4.2997, Avg Acc: 0.1795
+INFO:local_logger:Epoch[027/300], Step[0950/1602], Avg Loss: 4.2925, Avg Acc: 0.1748
+INFO:local_logger:Epoch[027/300], Step[1000/1602], Avg Loss: 4.2998, Avg Acc: 0.1853
+INFO:local_logger:Epoch[027/300], Step[1000/1602], Avg Loss: 4.2949, Avg Acc: 0.1746
+INFO:local_logger:Epoch[027/300], Step[1000/1602], Avg Loss: 4.2935, Avg Acc: 0.1786
+INFO:local_logger:Epoch[027/300], Step[1000/1602], Avg Loss: 4.3107, Avg Acc: 0.1789
+INFO:master_logger:Epoch[027/300], Step[1000/1602], Avg Loss: 4.2997, Avg Acc: 0.1794
+INFO:local_logger:Epoch[027/300], Step[1050/1602], Avg Loss: 4.2997, Avg Acc: 0.1850
+INFO:local_logger:Epoch[027/300], Step[1050/1602], Avg Loss: 4.2891, Avg Acc: 0.1799
+INFO:local_logger:Epoch[027/300], Step[1050/1602], Avg Loss: 4.2893, Avg Acc: 0.1754
+INFO:local_logger:Epoch[027/300], Step[1050/1602], Avg Loss: 4.3091, Avg Acc: 0.1789
+INFO:master_logger:Epoch[027/300], Step[1050/1602], Avg Loss: 4.2968, Avg Acc: 0.1798
+INFO:local_logger:Epoch[027/300], Step[1100/1602], Avg Loss: 4.3038, Avg Acc: 0.1837
+INFO:local_logger:Epoch[027/300], Step[1100/1602], Avg Loss: 4.2889, Avg Acc: 0.1757
+INFO:local_logger:Epoch[027/300], Step[1100/1602], Avg Loss: 4.3150, Avg Acc: 0.1781
+INFO:local_logger:Epoch[027/300], Step[1100/1602], Avg Loss: 4.2848, Avg Acc: 0.1805
+INFO:master_logger:Epoch[027/300], Step[1100/1602], Avg Loss: 4.2981, Avg Acc: 0.1795
+INFO:local_logger:Epoch[027/300], Step[1150/1602], Avg Loss: 4.2850, Avg Acc: 0.1757
+INFO:local_logger:Epoch[027/300], Step[1150/1602], Avg Loss: 4.2981, Avg Acc: 0.1848
+INFO:local_logger:Epoch[027/300], Step[1150/1602], Avg Loss: 4.3084, Avg Acc: 0.1773
+INFO:local_logger:Epoch[027/300], Step[1150/1602], Avg Loss: 4.2881, Avg Acc: 0.1794
+INFO:master_logger:Epoch[027/300], Step[1150/1602], Avg Loss: 4.2949, Avg Acc: 0.1793
+INFO:local_logger:Epoch[027/300], Step[1200/1602], Avg Loss: 4.2974, Avg Acc: 0.1847
+INFO:local_logger:Epoch[027/300], Step[1200/1602], Avg Loss: 4.2786, Avg Acc: 0.1810
+INFO:local_logger:Epoch[027/300], Step[1200/1602], Avg Loss: 4.2856, Avg Acc: 0.1763
+INFO:local_logger:Epoch[027/300], Step[1200/1602], Avg Loss: 4.3039, Avg Acc: 0.1776
+INFO:master_logger:Epoch[027/300], Step[1200/1602], Avg Loss: 4.2914, Avg Acc: 0.1799
+INFO:local_logger:Epoch[027/300], Step[1250/1602], Avg Loss: 4.2957, Avg Acc: 0.1846
+INFO:local_logger:Epoch[027/300], Step[1250/1602], Avg Loss: 4.3018, Avg Acc: 0.1790
+INFO:local_logger:Epoch[027/300], Step[1250/1602], Avg Loss: 4.2808, Avg Acc: 0.1756
+INFO:local_logger:Epoch[027/300], Step[1250/1602], Avg Loss: 4.2789, Avg Acc: 0.1807
+INFO:master_logger:Epoch[027/300], Step[1250/1602], Avg Loss: 4.2893, Avg Acc: 0.1800
+INFO:local_logger:Epoch[027/300], Step[1300/1602], Avg Loss: 4.2934, Avg Acc: 0.1853
+INFO:local_logger:Epoch[027/300], Step[1300/1602], Avg Loss: 4.2993, Avg Acc: 0.1783
+INFO:local_logger:Epoch[027/300], Step[1300/1602], Avg Loss: 4.2785, Avg Acc: 0.1810
+INFO:local_logger:Epoch[027/300], Step[1300/1602], Avg Loss: 4.2797, Avg Acc: 0.1766
+INFO:master_logger:Epoch[027/300], Step[1300/1602], Avg Loss: 4.2877, Avg Acc: 0.1803
+INFO:local_logger:Epoch[027/300], Step[1350/1602], Avg Loss: 4.2934, Avg Acc: 0.1844
+INFO:local_logger:Epoch[027/300], Step[1350/1602], Avg Loss: 4.2831, Avg Acc: 0.1770
+INFO:local_logger:Epoch[027/300], Step[1350/1602], Avg Loss: 4.2818, Avg Acc: 0.1813
+INFO:local_logger:Epoch[027/300], Step[1350/1602], Avg Loss: 4.3010, Avg Acc: 0.1783
+INFO:master_logger:Epoch[027/300], Step[1350/1602], Avg Loss: 4.2898, Avg Acc: 0.1802
+INFO:local_logger:Epoch[027/300], Step[1400/1602], Avg Loss: 4.2897, Avg Acc: 0.1855
+INFO:local_logger:Epoch[027/300], Step[1400/1602], Avg Loss: 4.2844, Avg Acc: 0.1803
+INFO:local_logger:Epoch[027/300], Step[1400/1602], Avg Loss: 4.2964, Avg Acc: 0.1780
+INFO:local_logger:Epoch[027/300], Step[1400/1602], Avg Loss: 4.2834, Avg Acc: 0.1780
+INFO:master_logger:Epoch[027/300], Step[1400/1602], Avg Loss: 4.2885, Avg Acc: 0.1805
+INFO:local_logger:Epoch[027/300], Step[1450/1602], Avg Loss: 4.2849, Avg Acc: 0.1789
+INFO:local_logger:Epoch[027/300], Step[1450/1602], Avg Loss: 4.2896, Avg Acc: 0.1859
+INFO:local_logger:Epoch[027/300], Step[1450/1602], Avg Loss: 4.2970, Avg Acc: 0.1792
+INFO:local_logger:Epoch[027/300], Step[1450/1602], Avg Loss: 4.2844, Avg Acc: 0.1803
+INFO:master_logger:Epoch[027/300], Step[1450/1602], Avg Loss: 4.2890, Avg Acc: 0.1811
+INFO:local_logger:Epoch[027/300], Step[1500/1602], Avg Loss: 4.2950, Avg Acc: 0.1851
+INFO:local_logger:Epoch[027/300], Step[1500/1602], Avg Loss: 4.2870, Avg Acc: 0.1785
+INFO:local_logger:Epoch[027/300], Step[1500/1602], Avg Loss: 4.2932, Avg Acc: 0.1806
+INFO:local_logger:Epoch[027/300], Step[1500/1602], Avg Loss: 4.2879, Avg Acc: 0.1795
+INFO:master_logger:Epoch[027/300], Step[1500/1602], Avg Loss: 4.2908, Avg Acc: 0.1809
+INFO:local_logger:Epoch[027/300], Step[1550/1602], Avg Loss: 4.2869, Avg Acc: 0.1775
+INFO:local_logger:Epoch[027/300], Step[1550/1602], Avg Loss: 4.2968, Avg Acc: 0.1851
+INFO:local_logger:Epoch[027/300], Step[1550/1602], Avg Loss: 4.2940, Avg Acc: 0.1801
+INFO:master_logger:Epoch[027/300], Step[1550/1602], Avg Loss: 4.2907, Avg Acc: 0.1805
+INFO:local_logger:Epoch[027/300], Step[1550/1602], Avg Loss: 4.2852, Avg Acc: 0.1794
+INFO:local_logger:Epoch[027/300], Step[1600/1602], Avg Loss: 4.2859, Avg Acc: 0.1781
+INFO:local_logger:Epoch[027/300], Step[1600/1602], Avg Loss: 4.2944, Avg Acc: 0.1857
+INFO:local_logger:Epoch[027/300], Step[1600/1602], Avg Loss: 4.2975, Avg Acc: 0.1800
+INFO:local_logger:Epoch[027/300], Step[1600/1602], Avg Loss: 4.2908, Avg Acc: 0.1788
+INFO:master_logger:Epoch[027/300], Step[1600/1602], Avg Loss: 4.2922, Avg Acc: 0.1807
+INFO:local_logger:----- Epoch[027/300], Train Loss: 4.2910, Train Acc: 0.1788, time: 3688.46
+INFO:local_logger:Now training epoch 28. LR=0.000385
+INFO:local_logger:----- Epoch[027/300], Train Loss: 4.2857, Train Acc: 0.1781, time: 3688.72
+INFO:local_logger:Now training epoch 28. LR=0.000385
+INFO:local_logger:----- Epoch[027/300], Train Loss: 4.2975, Train Acc: 0.1800, time: 3688.53
+INFO:local_logger:Now training epoch 28. LR=0.000385
+INFO:local_logger:----- Epoch[027/300], Train Loss: 4.2945, Train Acc: 0.1857, time: 3688.33
+INFO:master_logger:----- Epoch[027/300], Train Loss: 4.2922, Train Acc: 0.1806, time: 3688.33
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-27-Loss-4.294540662733227.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-27-Loss-4.294540662733227.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-27-Loss-4.294540662733227-EMA.pdparams
+INFO:local_logger:Now training epoch 28. LR=0.000385
+INFO:master_logger:Now training epoch 28. LR=0.000385
+INFO:local_logger:Epoch[028/300], Step[0000/1602], Avg Loss: 4.3527, Avg Acc: 0.2850
+INFO:local_logger:Epoch[028/300], Step[0000/1602], Avg Loss: 4.4214, Avg Acc: 0.0100
+INFO:local_logger:Epoch[028/300], Step[0000/1602], Avg Loss: 3.6639, Avg Acc: 0.3600
+INFO:local_logger:Epoch[028/300], Step[0000/1602], Avg Loss: 4.6011, Avg Acc: 0.2200
+INFO:master_logger:Epoch[028/300], Step[0000/1602], Avg Loss: 4.2598, Avg Acc: 0.2188
+INFO:local_logger:Epoch[028/300], Step[0050/1602], Avg Loss: 4.3151, Avg Acc: 0.1684
+INFO:local_logger:Epoch[028/300], Step[0050/1602], Avg Loss: 4.3127, Avg Acc: 0.1581
+INFO:local_logger:Epoch[028/300], Step[0050/1602], Avg Loss: 4.1800, Avg Acc: 0.1854
+INFO:local_logger:Epoch[028/300], Step[0050/1602], Avg Loss: 4.0502, Avg Acc: 0.1794
+INFO:master_logger:Epoch[028/300], Step[0050/1602], Avg Loss: 4.2145, Avg Acc: 0.1728
+INFO:local_logger:Epoch[028/300], Step[0100/1602], Avg Loss: 4.2979, Avg Acc: 0.1582
+INFO:local_logger:Epoch[028/300], Step[0100/1602], Avg Loss: 4.3362, Avg Acc: 0.1721
+INFO:local_logger:Epoch[028/300], Step[0100/1602], Avg Loss: 4.2615, Avg Acc: 0.1830
+INFO:local_logger:Epoch[028/300], Step[0100/1602], Avg Loss: 4.1929, Avg Acc: 0.1675
+INFO:master_logger:Epoch[028/300], Step[0100/1602], Avg Loss: 4.2721, Avg Acc: 0.1702
+INFO:local_logger:Epoch[028/300], Step[0150/1602], Avg Loss: 4.3330, Avg Acc: 0.1779
+INFO:local_logger:Epoch[028/300], Step[0150/1602], Avg Loss: 4.2771, Avg Acc: 0.1680
+INFO:local_logger:Epoch[028/300], Step[0150/1602], Avg Loss: 4.2143, Avg Acc: 0.1714
+INFO:local_logger:Epoch[028/300], Step[0150/1602], Avg Loss: 4.2462, Avg Acc: 0.1807
+INFO:master_logger:Epoch[028/300], Step[0150/1602], Avg Loss: 4.2677, Avg Acc: 0.1745
+INFO:local_logger:Epoch[028/300], Step[0200/1602], Avg Loss: 4.2121, Avg Acc: 0.1750
+INFO:local_logger:Epoch[028/300], Step[0200/1602], Avg Loss: 4.2718, Avg Acc: 0.1770
+INFO:local_logger:Epoch[028/300], Step[0200/1602], Avg Loss: 4.3323, Avg Acc: 0.1821
+INFO:local_logger:Epoch[028/300], Step[0200/1602], Avg Loss: 4.2176, Avg Acc: 0.1832
+INFO:master_logger:Epoch[028/300], Step[0200/1602], Avg Loss: 4.2585, Avg Acc: 0.1793
+INFO:local_logger:Epoch[028/300], Step[0250/1602], Avg Loss: 4.2810, Avg Acc: 0.1756
+INFO:local_logger:Epoch[028/300], Step[0250/1602], Avg Loss: 4.3396, Avg Acc: 0.1794
+INFO:local_logger:Epoch[028/300], Step[0250/1602], Avg Loss: 4.1927, Avg Acc: 0.1851
+INFO:local_logger:Epoch[028/300], Step[0250/1602], Avg Loss: 4.2372, Avg Acc: 0.1799
+INFO:master_logger:Epoch[028/300], Step[0250/1602], Avg Loss: 4.2626, Avg Acc: 0.1800
+INFO:local_logger:Epoch[028/300], Step[0300/1602], Avg Loss: 4.2714, Avg Acc: 0.1766
+INFO:local_logger:Epoch[028/300], Step[0300/1602], Avg Loss: 4.3307, Avg Acc: 0.1777
+INFO:master_logger:Epoch[028/300], Step[0300/1602], Avg Loss: 4.2678, Avg Acc: 0.1792
+INFO:local_logger:Epoch[028/300], Step[0300/1602], Avg Loss: 4.2556, Avg Acc: 0.1781
+INFO:local_logger:Epoch[028/300], Step[0300/1602], Avg Loss: 4.2135, Avg Acc: 0.1844
+INFO:local_logger:Epoch[028/300], Step[0350/1602], Avg Loss: 4.2621, Avg Acc: 0.1777
+INFO:local_logger:Epoch[028/300], Step[0350/1602], Avg Loss: 4.2373, Avg Acc: 0.1812
+INFO:local_logger:Epoch[028/300], Step[0350/1602], Avg Loss: 4.2493, Avg Acc: 0.1793
+INFO:master_logger:Epoch[028/300], Step[0350/1602], Avg Loss: 4.2680, Avg Acc: 0.1787
+INFO:local_logger:Epoch[028/300], Step[0350/1602], Avg Loss: 4.3234, Avg Acc: 0.1766
+INFO:local_logger:Epoch[028/300], Step[0400/1602], Avg Loss: 4.2698, Avg Acc: 0.1763
+INFO:local_logger:Epoch[028/300], Step[0400/1602], Avg Loss: 4.2393, Avg Acc: 0.1836
+INFO:local_logger:Epoch[028/300], Step[0400/1602], Avg Loss: 4.2529, Avg Acc: 0.1779
+INFO:master_logger:Epoch[028/300], Step[0400/1602], Avg Loss: 4.2682, Avg Acc: 0.1798
+INFO:local_logger:Epoch[028/300], Step[0400/1602], Avg Loss: 4.3109, Avg Acc: 0.1816
+INFO:local_logger:Epoch[028/300], Step[0450/1602], Avg Loss: 4.2688, Avg Acc: 0.1743
+INFO:local_logger:Epoch[028/300], Step[0450/1602], Avg Loss: 4.2547, Avg Acc: 0.1846
+INFO:local_logger:Epoch[028/300], Step[0450/1602], Avg Loss: 4.3052, Avg Acc: 0.1788
+INFO:local_logger:Epoch[028/300], Step[0450/1602], Avg Loss: 4.2528, Avg Acc: 0.1785
+INFO:master_logger:Epoch[028/300], Step[0450/1602], Avg Loss: 4.2704, Avg Acc: 0.1791
+INFO:local_logger:Epoch[028/300], Step[0500/1602], Avg Loss: 4.2589, Avg Acc: 0.1784
+INFO:local_logger:Epoch[028/300], Step[0500/1602], Avg Loss: 4.2587, Avg Acc: 0.1790
+INFO:local_logger:Epoch[028/300], Step[0500/1602], Avg Loss: 4.2628, Avg Acc: 0.1861
+INFO:local_logger:Epoch[028/300], Step[0500/1602], Avg Loss: 4.3082, Avg Acc: 0.1779
+INFO:master_logger:Epoch[028/300], Step[0500/1602], Avg Loss: 4.2722, Avg Acc: 0.1804
+INFO:local_logger:Epoch[028/300], Step[0550/1602], Avg Loss: 4.2555, Avg Acc: 0.1776
+INFO:local_logger:Epoch[028/300], Step[0550/1602], Avg Loss: 4.3087, Avg Acc: 0.1793
+INFO:local_logger:Epoch[028/300], Step[0550/1602], Avg Loss: 4.2726, Avg Acc: 0.1843
+INFO:local_logger:Epoch[028/300], Step[0550/1602], Avg Loss: 4.2603, Avg Acc: 0.1785
+INFO:master_logger:Epoch[028/300], Step[0550/1602], Avg Loss: 4.2743, Avg Acc: 0.1799
+INFO:local_logger:Epoch[028/300], Step[0600/1602], Avg Loss: 4.2570, Avg Acc: 0.1774
+INFO:local_logger:Epoch[028/300], Step[0600/1602], Avg Loss: 4.3094, Avg Acc: 0.1793
+INFO:local_logger:Epoch[028/300], Step[0600/1602], Avg Loss: 4.2526, Avg Acc: 0.1794
+INFO:master_logger:Epoch[028/300], Step[0600/1602], Avg Loss: 4.2736, Avg Acc: 0.1805
+INFO:local_logger:Epoch[028/300], Step[0600/1602], Avg Loss: 4.2754, Avg Acc: 0.1858
+INFO:local_logger:Epoch[028/300], Step[0650/1602], Avg Loss: 4.2565, Avg Acc: 0.1778
+INFO:local_logger:Epoch[028/300], Step[0650/1602], Avg Loss: 4.2969, Avg Acc: 0.1803
+INFO:local_logger:Epoch[028/300], Step[0650/1602], Avg Loss: 4.2467, Avg Acc: 0.1796
+INFO:local_logger:Epoch[028/300], Step[0650/1602], Avg Loss: 4.2684, Avg Acc: 0.1876
+INFO:master_logger:Epoch[028/300], Step[0650/1602], Avg Loss: 4.2671, Avg Acc: 0.1813
+INFO:local_logger:Epoch[028/300], Step[0700/1602], Avg Loss: 4.2572, Avg Acc: 0.1791
+INFO:local_logger:Epoch[028/300], Step[0700/1602], Avg Loss: 4.2962, Avg Acc: 0.1807
+INFO:local_logger:Epoch[028/300], Step[0700/1602], Avg Loss: 4.2484, Avg Acc: 0.1789
+INFO:local_logger:Epoch[028/300], Step[0700/1602], Avg Loss: 4.2597, Avg Acc: 0.1852
+INFO:master_logger:Epoch[028/300], Step[0700/1602], Avg Loss: 4.2654, Avg Acc: 0.1810
+INFO:local_logger:Epoch[028/300], Step[0750/1602], Avg Loss: 4.2873, Avg Acc: 0.1821
+INFO:local_logger:Epoch[028/300], Step[0750/1602], Avg Loss: 4.2515, Avg Acc: 0.1791
+INFO:local_logger:Epoch[028/300], Step[0750/1602], Avg Loss: 4.2562, Avg Acc: 0.1789
+INFO:local_logger:Epoch[028/300], Step[0750/1602], Avg Loss: 4.2597, Avg Acc: 0.1849
+INFO:master_logger:Epoch[028/300], Step[0750/1602], Avg Loss: 4.2637, Avg Acc: 0.1813
+INFO:local_logger:Epoch[028/300], Step[0800/1602], Avg Loss: 4.2832, Avg Acc: 0.1813
+INFO:local_logger:Epoch[028/300], Step[0800/1602], Avg Loss: 4.2555, Avg Acc: 0.1795
+INFO:local_logger:Epoch[028/300], Step[0800/1602], Avg Loss: 4.2619, Avg Acc: 0.1842
+INFO:local_logger:Epoch[028/300], Step[0800/1602], Avg Loss: 4.2491, Avg Acc: 0.1794
+INFO:master_logger:Epoch[028/300], Step[0800/1602], Avg Loss: 4.2624, Avg Acc: 0.1811
+INFO:local_logger:Epoch[028/300], Step[0850/1602], Avg Loss: 4.2789, Avg Acc: 0.1835
+INFO:local_logger:Epoch[028/300], Step[0850/1602], Avg Loss: 4.2523, Avg Acc: 0.1805
+INFO:local_logger:Epoch[028/300], Step[0850/1602], Avg Loss: 4.2585, Avg Acc: 0.1843
+INFO:local_logger:Epoch[028/300], Step[0850/1602], Avg Loss: 4.2467, Avg Acc: 0.1790
+INFO:master_logger:Epoch[028/300], Step[0850/1602], Avg Loss: 4.2591, Avg Acc: 0.1818
+INFO:local_logger:Epoch[028/300], Step[0900/1602], Avg Loss: 4.2476, Avg Acc: 0.1818
+INFO:local_logger:Epoch[028/300], Step[0900/1602], Avg Loss: 4.2826, Avg Acc: 0.1837
+INFO:local_logger:Epoch[028/300], Step[0900/1602], Avg Loss: 4.2592, Avg Acc: 0.1835
+INFO:local_logger:Epoch[028/300], Step[0900/1602], Avg Loss: 4.2454, Avg Acc: 0.1787
+INFO:master_logger:Epoch[028/300], Step[0900/1602], Avg Loss: 4.2587, Avg Acc: 0.1819
+INFO:local_logger:Epoch[028/300], Step[0950/1602], Avg Loss: 4.2576, Avg Acc: 0.1805
+INFO:local_logger:Epoch[028/300], Step[0950/1602], Avg Loss: 4.2558, Avg Acc: 0.1821
+INFO:local_logger:Epoch[028/300], Step[0950/1602], Avg Loss: 4.2813, Avg Acc: 0.1826
+INFO:local_logger:Epoch[028/300], Step[0950/1602], Avg Loss: 4.2504, Avg Acc: 0.1797
+INFO:master_logger:Epoch[028/300], Step[0950/1602], Avg Loss: 4.2613, Avg Acc: 0.1812
+INFO:local_logger:Epoch[028/300], Step[1000/1602], Avg Loss: 4.2528, Avg Acc: 0.1822
+INFO:local_logger:Epoch[028/300], Step[1000/1602], Avg Loss: 4.2762, Avg Acc: 0.1832
+INFO:local_logger:Epoch[028/300], Step[1000/1602], Avg Loss: 4.2544, Avg Acc: 0.1820
+INFO:local_logger:Epoch[028/300], Step[1000/1602], Avg Loss: 4.2463, Avg Acc: 0.1791
+INFO:master_logger:Epoch[028/300], Step[1000/1602], Avg Loss: 4.2574, Avg Acc: 0.1816
+INFO:local_logger:Epoch[028/300], Step[1050/1602], Avg Loss: 4.2451, Avg Acc: 0.1841
+INFO:local_logger:Epoch[028/300], Step[1050/1602], Avg Loss: 4.2715, Avg Acc: 0.1846
+INFO:local_logger:Epoch[028/300], Step[1050/1602], Avg Loss: 4.2445, Avg Acc: 0.1792
+INFO:local_logger:Epoch[028/300], Step[1050/1602], Avg Loss: 4.2484, Avg Acc: 0.1833
+INFO:master_logger:Epoch[028/300], Step[1050/1602], Avg Loss: 4.2524, Avg Acc: 0.1828
+INFO:local_logger:Epoch[028/300], Step[1100/1602], Avg Loss: 4.2427, Avg Acc: 0.1843
+INFO:local_logger:Epoch[028/300], Step[1100/1602], Avg Loss: 4.2440, Avg Acc: 0.1789
+INFO:local_logger:Epoch[028/300], Step[1100/1602], Avg Loss: 4.2459, Avg Acc: 0.1817
+INFO:local_logger:Epoch[028/300], Step[1100/1602], Avg Loss: 4.2687, Avg Acc: 0.1865
+INFO:master_logger:Epoch[028/300], Step[1100/1602], Avg Loss: 4.2503, Avg Acc: 0.1828
+INFO:local_logger:Epoch[028/300], Step[1150/1602], Avg Loss: 4.2385, Avg Acc: 0.1860
+INFO:local_logger:Epoch[028/300], Step[1150/1602], Avg Loss: 4.2415, Avg Acc: 0.1791
+INFO:local_logger:Epoch[028/300], Step[1150/1602], Avg Loss: 4.2653, Avg Acc: 0.1871
+INFO:master_logger:Epoch[028/300], Step[1150/1602], Avg Loss: 4.2470, Avg Acc: 0.1836
+INFO:local_logger:Epoch[028/300], Step[1150/1602], Avg Loss: 4.2427, Avg Acc: 0.1823
+INFO:local_logger:Epoch[028/300], Step[1200/1602], Avg Loss: 4.2466, Avg Acc: 0.1844
+INFO:local_logger:Epoch[028/300], Step[1200/1602], Avg Loss: 4.2363, Avg Acc: 0.1795
+INFO:local_logger:Epoch[028/300], Step[1200/1602], Avg Loss: 4.2444, Avg Acc: 0.1814
+INFO:local_logger:Epoch[028/300], Step[1200/1602], Avg Loss: 4.2658, Avg Acc: 0.1875
+INFO:master_logger:Epoch[028/300], Step[1200/1602], Avg Loss: 4.2483, Avg Acc: 0.1832
+INFO:local_logger:Epoch[028/300], Step[1250/1602], Avg Loss: 4.2462, Avg Acc: 0.1843
+INFO:local_logger:Epoch[028/300], Step[1250/1602], Avg Loss: 4.2368, Avg Acc: 0.1807
+INFO:local_logger:Epoch[028/300], Step[1250/1602], Avg Loss: 4.2456, Avg Acc: 0.1816
+INFO:local_logger:Epoch[028/300], Step[1250/1602], Avg Loss: 4.2667, Avg Acc: 0.1867
+INFO:master_logger:Epoch[028/300], Step[1250/1602], Avg Loss: 4.2488, Avg Acc: 0.1833
+INFO:local_logger:Epoch[028/300], Step[1300/1602], Avg Loss: 4.2683, Avg Acc: 0.1864
+INFO:local_logger:Epoch[028/300], Step[1300/1602], Avg Loss: 4.2409, Avg Acc: 0.1812
+INFO:local_logger:Epoch[028/300], Step[1300/1602], Avg Loss: 4.2466, Avg Acc: 0.1829
+INFO:local_logger:Epoch[028/300], Step[1300/1602], Avg Loss: 4.2467, Avg Acc: 0.1819
+INFO:master_logger:Epoch[028/300], Step[1300/1602], Avg Loss: 4.2506, Avg Acc: 0.1831
+INFO:local_logger:Epoch[028/300], Step[1350/1602], Avg Loss: 4.2484, Avg Acc: 0.1826
+INFO:master_logger:Epoch[028/300], Step[1350/1602], Avg Loss: 4.2515, Avg Acc: 0.1832
+INFO:local_logger:Epoch[028/300], Step[1350/1602], Avg Loss: 4.2689, Avg Acc: 0.1865
+INFO:local_logger:Epoch[028/300], Step[1350/1602], Avg Loss: 4.2413, Avg Acc: 0.1817
+INFO:local_logger:Epoch[028/300], Step[1350/1602], Avg Loss: 4.2474, Avg Acc: 0.1818
+INFO:local_logger:Epoch[028/300], Step[1400/1602], Avg Loss: 4.2493, Avg Acc: 0.1818
+INFO:local_logger:Epoch[028/300], Step[1400/1602], Avg Loss: 4.2430, Avg Acc: 0.1827
+INFO:local_logger:Epoch[028/300], Step[1400/1602], Avg Loss: 4.2428, Avg Acc: 0.1823
+INFO:master_logger:Epoch[028/300], Step[1400/1602], Avg Loss: 4.2503, Avg Acc: 0.1834
+INFO:local_logger:Epoch[028/300], Step[1400/1602], Avg Loss: 4.2662, Avg Acc: 0.1868
+INFO:local_logger:Epoch[028/300], Step[1450/1602], Avg Loss: 4.2446, Avg Acc: 0.1827
+INFO:local_logger:Epoch[028/300], Step[1450/1602], Avg Loss: 4.2421, Avg Acc: 0.1826
+INFO:local_logger:Epoch[028/300], Step[1450/1602], Avg Loss: 4.2412, Avg Acc: 0.1830
+INFO:local_logger:Epoch[028/300], Step[1450/1602], Avg Loss: 4.2646, Avg Acc: 0.1871
+INFO:master_logger:Epoch[028/300], Step[1450/1602], Avg Loss: 4.2482, Avg Acc: 0.1839
+INFO:local_logger:Epoch[028/300], Step[1500/1602], Avg Loss: 4.2415, Avg Acc: 0.1829
+INFO:local_logger:Epoch[028/300], Step[1500/1602], Avg Loss: 4.2422, Avg Acc: 0.1833
+INFO:local_logger:Epoch[028/300], Step[1500/1602], Avg Loss: 4.2380, Avg Acc: 0.1828
+INFO:local_logger:Epoch[028/300], Step[1500/1602], Avg Loss: 4.2644, Avg Acc: 0.1873
+INFO:master_logger:Epoch[028/300], Step[1500/1602], Avg Loss: 4.2465, Avg Acc: 0.1841
+INFO:local_logger:Epoch[028/300], Step[1550/1602], Avg Loss: 4.2398, Avg Acc: 0.1831
+INFO:local_logger:Epoch[028/300], Step[1550/1602], Avg Loss: 4.2611, Avg Acc: 0.1879
+INFO:local_logger:Epoch[028/300], Step[1550/1602], Avg Loss: 4.2390, Avg Acc: 0.1831
+INFO:local_logger:Epoch[028/300], Step[1550/1602], Avg Loss: 4.2422, Avg Acc: 0.1823
+INFO:master_logger:Epoch[028/300], Step[1550/1602], Avg Loss: 4.2455, Avg Acc: 0.1841
+INFO:local_logger:Epoch[028/300], Step[1600/1602], Avg Loss: 4.2403, Avg Acc: 0.1829
+INFO:local_logger:Epoch[028/300], Step[1600/1602], Avg Loss: 4.2418, Avg Acc: 0.1827
+INFO:local_logger:Epoch[028/300], Step[1600/1602], Avg Loss: 4.2419, Avg Acc: 0.1830
+INFO:master_logger:Epoch[028/300], Step[1600/1602], Avg Loss: 4.2462, Avg Acc: 0.1842
+INFO:local_logger:Epoch[028/300], Step[1600/1602], Avg Loss: 4.2609, Avg Acc: 0.1881
+INFO:local_logger:----- Epoch[028/300], Train Loss: 4.2420, Train Acc: 0.1827, time: 3713.03
+INFO:master_logger:----- Epoch[028/300], Train Loss: 4.2464, Train Acc: 0.1842, time: 3713.03
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-28-Loss-4.242009024799719.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-28-Loss-4.242009024799719.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-28-Loss-4.242009024799719-EMA.pdparams
+INFO:local_logger:Now training epoch 29. LR=0.000384
+INFO:master_logger:Now training epoch 29. LR=0.000384
+INFO:local_logger:----- Epoch[028/300], Train Loss: 4.2611, Train Acc: 0.1881, time: 3713.70
+INFO:local_logger:Now training epoch 29. LR=0.000384
+INFO:local_logger:----- Epoch[028/300], Train Loss: 4.2404, Train Acc: 0.1829, time: 3713.64
+INFO:local_logger:Now training epoch 29. LR=0.000384
+INFO:local_logger:----- Epoch[028/300], Train Loss: 4.2421, Train Acc: 0.1830, time: 3713.71
+INFO:local_logger:Now training epoch 29. LR=0.000384
+INFO:local_logger:Epoch[029/300], Step[0000/1602], Avg Loss: 4.4729, Avg Acc: 0.2300
+INFO:local_logger:Epoch[029/300], Step[0000/1602], Avg Loss: 4.7785, Avg Acc: 0.0450
+INFO:master_logger:Epoch[029/300], Step[0000/1602], Avg Loss: 3.9986, Avg Acc: 0.2612
+INFO:local_logger:Epoch[029/300], Step[0000/1602], Avg Loss: 3.2048, Avg Acc: 0.4200
+INFO:local_logger:Epoch[029/300], Step[0000/1602], Avg Loss: 3.5380, Avg Acc: 0.3500
+INFO:local_logger:Epoch[029/300], Step[0050/1602], Avg Loss: 4.3053, Avg Acc: 0.1983
+INFO:local_logger:Epoch[029/300], Step[0050/1602], Avg Loss: 4.2121, Avg Acc: 0.1379
+INFO:local_logger:Epoch[029/300], Step[0050/1602], Avg Loss: 4.2555, Avg Acc: 0.2055
+INFO:local_logger:Epoch[029/300], Step[0050/1602], Avg Loss: 4.2585, Avg Acc: 0.2108
+INFO:master_logger:Epoch[029/300], Step[0050/1602], Avg Loss: 4.2578, Avg Acc: 0.1881
+INFO:local_logger:Epoch[029/300], Step[0100/1602], Avg Loss: 4.1999, Avg Acc: 0.1661
+INFO:local_logger:Epoch[029/300], Step[0100/1602], Avg Loss: 4.3060, Avg Acc: 0.1884
+INFO:local_logger:Epoch[029/300], Step[0100/1602], Avg Loss: 4.2467, Avg Acc: 0.1759
+INFO:local_logger:Epoch[029/300], Step[0100/1602], Avg Loss: 4.2877, Avg Acc: 0.2019
+INFO:master_logger:Epoch[029/300], Step[0100/1602], Avg Loss: 4.2601, Avg Acc: 0.1831
+INFO:local_logger:Epoch[029/300], Step[0150/1602], Avg Loss: 4.2566, Avg Acc: 0.1742
+INFO:local_logger:Epoch[029/300], Step[0150/1602], Avg Loss: 4.2950, Avg Acc: 0.1857
+INFO:local_logger:Epoch[029/300], Step[0150/1602], Avg Loss: 4.3158, Avg Acc: 0.1818
+INFO:local_logger:Epoch[029/300], Step[0150/1602], Avg Loss: 4.2050, Avg Acc: 0.1714
+INFO:master_logger:Epoch[029/300], Step[0150/1602], Avg Loss: 4.2681, Avg Acc: 0.1783
+INFO:local_logger:Epoch[029/300], Step[0200/1602], Avg Loss: 4.2306, Avg Acc: 0.1824
+INFO:local_logger:Epoch[029/300], Step[0200/1602], Avg Loss: 4.2896, Avg Acc: 0.1821
+INFO:local_logger:Epoch[029/300], Step[0200/1602], Avg Loss: 4.3119, Avg Acc: 0.1858
+INFO:master_logger:Epoch[029/300], Step[0200/1602], Avg Loss: 4.2674, Avg Acc: 0.1809
+INFO:local_logger:Epoch[029/300], Step[0200/1602], Avg Loss: 4.2377, Avg Acc: 0.1731
+INFO:local_logger:Epoch[029/300], Step[0250/1602], Avg Loss: 4.2175, Avg Acc: 0.1798
+INFO:local_logger:Epoch[029/300], Step[0250/1602], Avg Loss: 4.2172, Avg Acc: 0.1842
+INFO:local_logger:Epoch[029/300], Step[0250/1602], Avg Loss: 4.2839, Avg Acc: 0.1806
+INFO:local_logger:Epoch[029/300], Step[0250/1602], Avg Loss: 4.2681, Avg Acc: 0.1818
+INFO:master_logger:Epoch[029/300], Step[0250/1602], Avg Loss: 4.2467, Avg Acc: 0.1816
+INFO:local_logger:Epoch[029/300], Step[0300/1602], Avg Loss: 4.2318, Avg Acc: 0.1795
+INFO:local_logger:Epoch[029/300], Step[0300/1602], Avg Loss: 4.2806, Avg Acc: 0.1806
+INFO:local_logger:Epoch[029/300], Step[0300/1602], Avg Loss: 4.2590, Avg Acc: 0.1841
+INFO:local_logger:Epoch[029/300], Step[0300/1602], Avg Loss: 4.2394, Avg Acc: 0.1838
+INFO:master_logger:Epoch[029/300], Step[0300/1602], Avg Loss: 4.2527, Avg Acc: 0.1820
+INFO:local_logger:Epoch[029/300], Step[0350/1602], Avg Loss: 4.2215, Avg Acc: 0.1809
+INFO:local_logger:Epoch[029/300], Step[0350/1602], Avg Loss: 4.2508, Avg Acc: 0.1849
+INFO:local_logger:Epoch[029/300], Step[0350/1602], Avg Loss: 4.2770, Avg Acc: 0.1756
+INFO:master_logger:Epoch[029/300], Step[0350/1602], Avg Loss: 4.2432, Avg Acc: 0.1823
+INFO:local_logger:Epoch[029/300], Step[0350/1602], Avg Loss: 4.2234, Avg Acc: 0.1877
+INFO:local_logger:Epoch[029/300], Step[0400/1602], Avg Loss: 4.2138, Avg Acc: 0.1766
+INFO:local_logger:Epoch[029/300], Step[0400/1602], Avg Loss: 4.2484, Avg Acc: 0.1818
+INFO:local_logger:Epoch[029/300], Step[0400/1602], Avg Loss: 4.2321, Avg Acc: 0.1871
+INFO:local_logger:Epoch[029/300], Step[0400/1602], Avg Loss: 4.2523, Avg Acc: 0.1796
+INFO:master_logger:Epoch[029/300], Step[0400/1602], Avg Loss: 4.2366, Avg Acc: 0.1813
+INFO:local_logger:Epoch[029/300], Step[0450/1602], Avg Loss: 4.1979, Avg Acc: 0.1830
+INFO:local_logger:Epoch[029/300], Step[0450/1602], Avg Loss: 4.2200, Avg Acc: 0.1862
+INFO:local_logger:Epoch[029/300], Step[0450/1602], Avg Loss: 4.2611, Avg Acc: 0.1801
+INFO:local_logger:Epoch[029/300], Step[0450/1602], Avg Loss: 4.2507, Avg Acc: 0.1824
+INFO:master_logger:Epoch[029/300], Step[0450/1602], Avg Loss: 4.2324, Avg Acc: 0.1829
+INFO:local_logger:Epoch[029/300], Step[0500/1602], Avg Loss: 4.2025, Avg Acc: 0.1831
+INFO:local_logger:Epoch[029/300], Step[0500/1602], Avg Loss: 4.2169, Avg Acc: 0.1881
+INFO:local_logger:Epoch[029/300], Step[0500/1602], Avg Loss: 4.2540, Avg Acc: 0.1776
+INFO:local_logger:Epoch[029/300], Step[0500/1602], Avg Loss: 4.2533, Avg Acc: 0.1850
+INFO:master_logger:Epoch[029/300], Step[0500/1602], Avg Loss: 4.2317, Avg Acc: 0.1835
+INFO:local_logger:Epoch[029/300], Step[0550/1602], Avg Loss: 4.2571, Avg Acc: 0.1762
+INFO:local_logger:Epoch[029/300], Step[0550/1602], Avg Loss: 4.2520, Avg Acc: 0.1833
+INFO:local_logger:Epoch[029/300], Step[0550/1602], Avg Loss: 4.2171, Avg Acc: 0.1843
+INFO:local_logger:Epoch[029/300], Step[0550/1602], Avg Loss: 4.2272, Avg Acc: 0.1863
+INFO:master_logger:Epoch[029/300], Step[0550/1602], Avg Loss: 4.2383, Avg Acc: 0.1825
+INFO:local_logger:Epoch[029/300], Step[0600/1602], Avg Loss: 4.2331, Avg Acc: 0.1862
+INFO:local_logger:Epoch[029/300], Step[0600/1602], Avg Loss: 4.2018, Avg Acc: 0.1849
+INFO:local_logger:Epoch[029/300], Step[0600/1602], Avg Loss: 4.2427, Avg Acc: 0.1792
+INFO:local_logger:Epoch[029/300], Step[0600/1602], Avg Loss: 4.2615, Avg Acc: 0.1832
+INFO:master_logger:Epoch[029/300], Step[0600/1602], Avg Loss: 4.2348, Avg Acc: 0.1834
+INFO:local_logger:Epoch[029/300], Step[0650/1602], Avg Loss: 4.2019, Avg Acc: 0.1867
+INFO:local_logger:Epoch[029/300], Step[0650/1602], Avg Loss: 4.2351, Avg Acc: 0.1800
+INFO:local_logger:Epoch[029/300], Step[0650/1602], Avg Loss: 4.2550, Avg Acc: 0.1832
+INFO:master_logger:Epoch[029/300], Step[0650/1602], Avg Loss: 4.2299, Avg Acc: 0.1841
+INFO:local_logger:Epoch[029/300], Step[0650/1602], Avg Loss: 4.2278, Avg Acc: 0.1866
+INFO:local_logger:Epoch[029/300], Step[0700/1602], Avg Loss: 4.2245, Avg Acc: 0.1866
+INFO:local_logger:Epoch[029/300], Step[0700/1602], Avg Loss: 4.2003, Avg Acc: 0.1880
+INFO:local_logger:Epoch[029/300], Step[0700/1602], Avg Loss: 4.2456, Avg Acc: 0.1785
+INFO:local_logger:Epoch[029/300], Step[0700/1602], Avg Loss: 4.2422, Avg Acc: 0.1849
+INFO:master_logger:Epoch[029/300], Step[0700/1602], Avg Loss: 4.2282, Avg Acc: 0.1845
+INFO:local_logger:Epoch[029/300], Step[0750/1602], Avg Loss: 4.2238, Avg Acc: 0.1876
+INFO:local_logger:Epoch[029/300], Step[0750/1602], Avg Loss: 4.2067, Avg Acc: 0.1895
+INFO:local_logger:Epoch[029/300], Step[0750/1602], Avg Loss: 4.2431, Avg Acc: 0.1850
+INFO:local_logger:Epoch[029/300], Step[0750/1602], Avg Loss: 4.2462, Avg Acc: 0.1789
+INFO:master_logger:Epoch[029/300], Step[0750/1602], Avg Loss: 4.2300, Avg Acc: 0.1853
+INFO:local_logger:Epoch[029/300], Step[0800/1602], Avg Loss: 4.2188, Avg Acc: 0.1883
+INFO:local_logger:Epoch[029/300], Step[0800/1602], Avg Loss: 4.2257, Avg Acc: 0.1875
+INFO:local_logger:Epoch[029/300], Step[0800/1602], Avg Loss: 4.2419, Avg Acc: 0.1797
+INFO:master_logger:Epoch[029/300], Step[0800/1602], Avg Loss: 4.2325, Avg Acc: 0.1849
+INFO:local_logger:Epoch[029/300], Step[0800/1602], Avg Loss: 4.2437, Avg Acc: 0.1841
+INFO:local_logger:Epoch[029/300], Step[0850/1602], Avg Loss: 4.2118, Avg Acc: 0.1881
+INFO:local_logger:Epoch[029/300], Step[0850/1602], Avg Loss: 4.2304, Avg Acc: 0.1882
+INFO:local_logger:Epoch[029/300], Step[0850/1602], Avg Loss: 4.2415, Avg Acc: 0.1861
+INFO:local_logger:Epoch[029/300], Step[0850/1602], Avg Loss: 4.2413, Avg Acc: 0.1797
+INFO:master_logger:Epoch[029/300], Step[0850/1602], Avg Loss: 4.2313, Avg Acc: 0.1856
+INFO:local_logger:Epoch[029/300], Step[0900/1602], Avg Loss: 4.2150, Avg Acc: 0.1872
+INFO:local_logger:Epoch[029/300], Step[0900/1602], Avg Loss: 4.2384, Avg Acc: 0.1812
+INFO:local_logger:Epoch[029/300], Step[0900/1602], Avg Loss: 4.2421, Avg Acc: 0.1856
+INFO:local_logger:Epoch[029/300], Step[0900/1602], Avg Loss: 4.2224, Avg Acc: 0.1885
+INFO:master_logger:Epoch[029/300], Step[0900/1602], Avg Loss: 4.2295, Avg Acc: 0.1856
+INFO:local_logger:Epoch[029/300], Step[0950/1602], Avg Loss: 4.2137, Avg Acc: 0.1874
+INFO:local_logger:Epoch[029/300], Step[0950/1602], Avg Loss: 4.2390, Avg Acc: 0.1797
+INFO:local_logger:Epoch[029/300], Step[0950/1602], Avg Loss: 4.2413, Avg Acc: 0.1856
+INFO:master_logger:Epoch[029/300], Step[0950/1602], Avg Loss: 4.2287, Avg Acc: 0.1857
+INFO:local_logger:Epoch[029/300], Step[0950/1602], Avg Loss: 4.2208, Avg Acc: 0.1900
+INFO:local_logger:Epoch[029/300], Step[1000/1602], Avg Loss: 4.2219, Avg Acc: 0.1870
+INFO:local_logger:Epoch[029/300], Step[1000/1602], Avg Loss: 4.2367, Avg Acc: 0.1788
+INFO:local_logger:Epoch[029/300], Step[1000/1602], Avg Loss: 4.2412, Avg Acc: 0.1857
+INFO:local_logger:Epoch[029/300], Step[1000/1602], Avg Loss: 4.2173, Avg Acc: 0.1915
+INFO:master_logger:Epoch[029/300], Step[1000/1602], Avg Loss: 4.2293, Avg Acc: 0.1858
+INFO:local_logger:Epoch[029/300], Step[1050/1602], Avg Loss: 4.2224, Avg Acc: 0.1903
+INFO:local_logger:Epoch[029/300], Step[1050/1602], Avg Loss: 4.2376, Avg Acc: 0.1852
+INFO:local_logger:Epoch[029/300], Step[1050/1602], Avg Loss: 4.2352, Avg Acc: 0.1809
+INFO:local_logger:Epoch[029/300], Step[1050/1602], Avg Loss: 4.2189, Avg Acc: 0.1874
+INFO:master_logger:Epoch[029/300], Step[1050/1602], Avg Loss: 4.2285, Avg Acc: 0.1860
+INFO:local_logger:Epoch[029/300], Step[1100/1602], Avg Loss: 4.2173, Avg Acc: 0.1908
+INFO:local_logger:Epoch[029/300], Step[1100/1602], Avg Loss: 4.2206, Avg Acc: 0.1871
+INFO:local_logger:Epoch[029/300], Step[1100/1602], Avg Loss: 4.2360, Avg Acc: 0.1860
+INFO:local_logger:Epoch[029/300], Step[1100/1602], Avg Loss: 4.2292, Avg Acc: 0.1815
+INFO:master_logger:Epoch[029/300], Step[1100/1602], Avg Loss: 4.2257, Avg Acc: 0.1863
+INFO:local_logger:Epoch[029/300], Step[1150/1602], Avg Loss: 4.2291, Avg Acc: 0.1809
+INFO:local_logger:Epoch[029/300], Step[1150/1602], Avg Loss: 4.2204, Avg Acc: 0.1873
+INFO:local_logger:Epoch[029/300], Step[1150/1602], Avg Loss: 4.2392, Avg Acc: 0.1858
+INFO:local_logger:Epoch[029/300], Step[1150/1602], Avg Loss: 4.2196, Avg Acc: 0.1903
+INFO:master_logger:Epoch[029/300], Step[1150/1602], Avg Loss: 4.2271, Avg Acc: 0.1861
+INFO:local_logger:Epoch[029/300], Step[1200/1602], Avg Loss: 4.2236, Avg Acc: 0.1876
+INFO:local_logger:Epoch[029/300], Step[1200/1602], Avg Loss: 4.2405, Avg Acc: 0.1859
+INFO:local_logger:Epoch[029/300], Step[1200/1602], Avg Loss: 4.2276, Avg Acc: 0.1818
+INFO:local_logger:Epoch[029/300], Step[1200/1602], Avg Loss: 4.2195, Avg Acc: 0.1902
+INFO:master_logger:Epoch[029/300], Step[1200/1602], Avg Loss: 4.2278, Avg Acc: 0.1863
+INFO:local_logger:Epoch[029/300], Step[1250/1602], Avg Loss: 4.2235, Avg Acc: 0.1871
+INFO:local_logger:Epoch[029/300], Step[1250/1602], Avg Loss: 4.2285, Avg Acc: 0.1825
+INFO:local_logger:Epoch[029/300], Step[1250/1602], Avg Loss: 4.2187, Avg Acc: 0.1889
+INFO:master_logger:Epoch[029/300], Step[1250/1602], Avg Loss: 4.2272, Avg Acc: 0.1862
+INFO:local_logger:Epoch[029/300], Step[1250/1602], Avg Loss: 4.2380, Avg Acc: 0.1863
+INFO:local_logger:Epoch[029/300], Step[1300/1602], Avg Loss: 4.2165, Avg Acc: 0.1865
+INFO:local_logger:Epoch[029/300], Step[1300/1602], Avg Loss: 4.2282, Avg Acc: 0.1830
+INFO:local_logger:Epoch[029/300], Step[1300/1602], Avg Loss: 4.2217, Avg Acc: 0.1888
+INFO:local_logger:Epoch[029/300], Step[1300/1602], Avg Loss: 4.2380, Avg Acc: 0.1854
+INFO:master_logger:Epoch[029/300], Step[1300/1602], Avg Loss: 4.2261, Avg Acc: 0.1860
+INFO:local_logger:Epoch[029/300], Step[1350/1602], Avg Loss: 4.2202, Avg Acc: 0.1853
+INFO:local_logger:Epoch[029/300], Step[1350/1602], Avg Loss: 4.2349, Avg Acc: 0.1830
+INFO:local_logger:Epoch[029/300], Step[1350/1602], Avg Loss: 4.2321, Avg Acc: 0.1861
+INFO:local_logger:Epoch[029/300], Step[1350/1602], Avg Loss: 4.2223, Avg Acc: 0.1883
+INFO:master_logger:Epoch[029/300], Step[1350/1602], Avg Loss: 4.2274, Avg Acc: 0.1857
+INFO:local_logger:Epoch[029/300], Step[1400/1602], Avg Loss: 4.2230, Avg Acc: 0.1849
+INFO:master_logger:Epoch[029/300], Step[1400/1602], Avg Loss: 4.2277, Avg Acc: 0.1862
+INFO:local_logger:Epoch[029/300], Step[1400/1602], Avg Loss: 4.2348, Avg Acc: 0.1835
+INFO:local_logger:Epoch[029/300], Step[1400/1602], Avg Loss: 4.2375, Avg Acc: 0.1868
+INFO:local_logger:Epoch[029/300], Step[1400/1602], Avg Loss: 4.2157, Avg Acc: 0.1897
+INFO:local_logger:Epoch[029/300], Step[1450/1602], Avg Loss: 4.2197, Avg Acc: 0.1856
+INFO:local_logger:Epoch[029/300], Step[1450/1602], Avg Loss: 4.2387, Avg Acc: 0.1865
+INFO:local_logger:Epoch[029/300], Step[1450/1602], Avg Loss: 4.2315, Avg Acc: 0.1835
+INFO:local_logger:Epoch[029/300], Step[1450/1602], Avg Loss: 4.2155, Avg Acc: 0.1906
+INFO:master_logger:Epoch[029/300], Step[1450/1602], Avg Loss: 4.2263, Avg Acc: 0.1866
+INFO:local_logger:Epoch[029/300], Step[1500/1602], Avg Loss: 4.2180, Avg Acc: 0.1857
+INFO:local_logger:Epoch[029/300], Step[1500/1602], Avg Loss: 4.2356, Avg Acc: 0.1858
+INFO:local_logger:Epoch[029/300], Step[1500/1602], Avg Loss: 4.2315, Avg Acc: 0.1834
+INFO:local_logger:Epoch[029/300], Step[1500/1602], Avg Loss: 4.2121, Avg Acc: 0.1907
+INFO:master_logger:Epoch[029/300], Step[1500/1602], Avg Loss: 4.2243, Avg Acc: 0.1864
+INFO:local_logger:Epoch[029/300], Step[1550/1602], Avg Loss: 4.2195, Avg Acc: 0.1854
+INFO:local_logger:Epoch[029/300], Step[1550/1602], Avg Loss: 4.2113, Avg Acc: 0.1913
+INFO:local_logger:Epoch[029/300], Step[1550/1602], Avg Loss: 4.2406, Avg Acc: 0.1859
+INFO:local_logger:Epoch[029/300], Step[1550/1602], Avg Loss: 4.2342, Avg Acc: 0.1827
+INFO:master_logger:Epoch[029/300], Step[1550/1602], Avg Loss: 4.2264, Avg Acc: 0.1863
+INFO:local_logger:Epoch[029/300], Step[1600/1602], Avg Loss: 4.2099, Avg Acc: 0.1907
+INFO:local_logger:Epoch[029/300], Step[1600/1602], Avg Loss: 4.2344, Avg Acc: 0.1834
+INFO:local_logger:Epoch[029/300], Step[1600/1602], Avg Loss: 4.2395, Avg Acc: 0.1867
+INFO:local_logger:Epoch[029/300], Step[1600/1602], Avg Loss: 4.2180, Avg Acc: 0.1859
+INFO:master_logger:Epoch[029/300], Step[1600/1602], Avg Loss: 4.2254, Avg Acc: 0.1867
+INFO:local_logger:----- Epoch[029/300], Train Loss: 4.2345, Train Acc: 0.1834, time: 3714.53
+INFO:local_logger:Now training epoch 30. LR=0.000384
+INFO:local_logger:----- Epoch[029/300], Train Loss: 4.2097, Train Acc: 0.1907, time: 3714.55
+INFO:local_logger:Now training epoch 30. LR=0.000384
+INFO:local_logger:----- Epoch[029/300], Train Loss: 4.2394, Train Acc: 0.1868, time: 3714.55
+INFO:local_logger:Now training epoch 30. LR=0.000384
+INFO:local_logger:----- Epoch[029/300], Train Loss: 4.2177, Train Acc: 0.1858, time: 3714.62
+INFO:master_logger:----- Epoch[029/300], Train Loss: 4.2253, Train Acc: 0.1867, time: 3714.62
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-29-Loss-4.217677223495267.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-29-Loss-4.217677223495267.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-29-Loss-4.217677223495267-EMA.pdparams
+INFO:local_logger:Now training epoch 30. LR=0.000384
+INFO:master_logger:Now training epoch 30. LR=0.000384
+INFO:local_logger:Epoch[030/300], Step[0000/1602], Avg Loss: 4.7425, Avg Acc: 0.2150
+INFO:local_logger:Epoch[030/300], Step[0000/1602], Avg Loss: 4.6852, Avg Acc: 0.1500
+INFO:local_logger:Epoch[030/300], Step[0000/1602], Avg Loss: 4.1387, Avg Acc: 0.3100
+INFO:local_logger:Epoch[030/300], Step[0000/1602], Avg Loss: 3.3675, Avg Acc: 0.0100
+INFO:master_logger:Epoch[030/300], Step[0000/1602], Avg Loss: 4.2335, Avg Acc: 0.1713
+INFO:local_logger:Epoch[030/300], Step[0050/1602], Avg Loss: 4.2591, Avg Acc: 0.1915
+INFO:local_logger:Epoch[030/300], Step[0050/1602], Avg Loss: 4.1327, Avg Acc: 0.1769
+INFO:local_logger:Epoch[030/300], Step[0050/1602], Avg Loss: 4.2368, Avg Acc: 0.1728
+INFO:local_logger:Epoch[030/300], Step[0050/1602], Avg Loss: 4.1202, Avg Acc: 0.2080
+INFO:master_logger:Epoch[030/300], Step[0050/1602], Avg Loss: 4.1872, Avg Acc: 0.1873
+INFO:local_logger:Epoch[030/300], Step[0100/1602], Avg Loss: 4.2166, Avg Acc: 0.2133
+INFO:local_logger:Epoch[030/300], Step[0100/1602], Avg Loss: 4.2152, Avg Acc: 0.1687
+INFO:local_logger:Epoch[030/300], Step[0100/1602], Avg Loss: 4.1316, Avg Acc: 0.1666
+INFO:local_logger:Epoch[030/300], Step[0100/1602], Avg Loss: 4.1968, Avg Acc: 0.1837
+INFO:master_logger:Epoch[030/300], Step[0100/1602], Avg Loss: 4.1900, Avg Acc: 0.1831
+INFO:local_logger:Epoch[030/300], Step[0150/1602], Avg Loss: 4.1981, Avg Acc: 0.1838
+INFO:local_logger:Epoch[030/300], Step[0150/1602], Avg Loss: 4.2876, Avg Acc: 0.2148
+INFO:local_logger:Epoch[030/300], Step[0150/1602], Avg Loss: 4.2147, Avg Acc: 0.1650
+INFO:local_logger:Epoch[030/300], Step[0150/1602], Avg Loss: 4.1227, Avg Acc: 0.1801
+INFO:master_logger:Epoch[030/300], Step[0150/1602], Avg Loss: 4.2057, Avg Acc: 0.1859
+INFO:local_logger:Epoch[030/300], Step[0200/1602], Avg Loss: 4.2805, Avg Acc: 0.2096
+INFO:local_logger:Epoch[030/300], Step[0200/1602], Avg Loss: 4.2161, Avg Acc: 0.1726
+INFO:local_logger:Epoch[030/300], Step[0200/1602], Avg Loss: 4.1446, Avg Acc: 0.1868
+INFO:local_logger:Epoch[030/300], Step[0200/1602], Avg Loss: 4.2039, Avg Acc: 0.1805
+INFO:master_logger:Epoch[030/300], Step[0200/1602], Avg Loss: 4.2113, Avg Acc: 0.1874
+INFO:local_logger:Epoch[030/300], Step[0250/1602], Avg Loss: 4.2518, Avg Acc: 0.2103
+INFO:local_logger:Epoch[030/300], Step[0250/1602], Avg Loss: 4.2356, Avg Acc: 0.1739
+INFO:local_logger:Epoch[030/300], Step[0250/1602], Avg Loss: 4.1658, Avg Acc: 0.1867
+INFO:master_logger:Epoch[030/300], Step[0250/1602], Avg Loss: 4.2160, Avg Acc: 0.1883
+INFO:local_logger:Epoch[030/300], Step[0250/1602], Avg Loss: 4.2108, Avg Acc: 0.1824
+INFO:local_logger:Epoch[030/300], Step[0300/1602], Avg Loss: 4.2362, Avg Acc: 0.2068
+INFO:local_logger:Epoch[030/300], Step[0300/1602], Avg Loss: 4.1844, Avg Acc: 0.1800
+INFO:local_logger:Epoch[030/300], Step[0300/1602], Avg Loss: 4.2216, Avg Acc: 0.1788
+INFO:master_logger:Epoch[030/300], Step[0300/1602], Avg Loss: 4.2170, Avg Acc: 0.1862
+INFO:local_logger:Epoch[030/300], Step[0300/1602], Avg Loss: 4.2259, Avg Acc: 0.1791
+INFO:local_logger:Epoch[030/300], Step[0350/1602], Avg Loss: 4.2054, Avg Acc: 0.1799
+INFO:local_logger:Epoch[030/300], Step[0350/1602], Avg Loss: 4.2338, Avg Acc: 0.2032
+INFO:local_logger:Epoch[030/300], Step[0350/1602], Avg Loss: 4.2039, Avg Acc: 0.1783
+INFO:local_logger:Epoch[030/300], Step[0350/1602], Avg Loss: 4.2325, Avg Acc: 0.1762
+INFO:master_logger:Epoch[030/300], Step[0350/1602], Avg Loss: 4.2189, Avg Acc: 0.1844
+INFO:local_logger:Epoch[030/300], Step[0400/1602], Avg Loss: 4.2094, Avg Acc: 0.2087
+INFO:local_logger:Epoch[030/300], Step[0400/1602], Avg Loss: 4.2353, Avg Acc: 0.1798
+INFO:local_logger:Epoch[030/300], Step[0400/1602], Avg Loss: 4.1931, Avg Acc: 0.1798
+INFO:master_logger:Epoch[030/300], Step[0400/1602], Avg Loss: 4.2101, Avg Acc: 0.1886
+INFO:local_logger:Epoch[030/300], Step[0400/1602], Avg Loss: 4.2027, Avg Acc: 0.1861
+INFO:local_logger:Epoch[030/300], Step[0450/1602], Avg Loss: 4.2129, Avg Acc: 0.2053
+INFO:local_logger:Epoch[030/300], Step[0450/1602], Avg Loss: 4.1990, Avg Acc: 0.1895
+INFO:master_logger:Epoch[030/300], Step[0450/1602], Avg Loss: 4.2103, Avg Acc: 0.1886
+INFO:local_logger:Epoch[030/300], Step[0450/1602], Avg Loss: 4.2002, Avg Acc: 0.1809
+INFO:local_logger:Epoch[030/300], Step[0450/1602], Avg Loss: 4.2292, Avg Acc: 0.1789
+INFO:local_logger:Epoch[030/300], Step[0500/1602], Avg Loss: 4.2349, Avg Acc: 0.1766
+INFO:local_logger:Epoch[030/300], Step[0500/1602], Avg Loss: 4.2037, Avg Acc: 0.1821
+INFO:local_logger:Epoch[030/300], Step[0500/1602], Avg Loss: 4.2276, Avg Acc: 0.2050
+INFO:local_logger:Epoch[030/300], Step[0500/1602], Avg Loss: 4.2009, Avg Acc: 0.1880
+INFO:master_logger:Epoch[030/300], Step[0500/1602], Avg Loss: 4.2168, Avg Acc: 0.1879
+INFO:local_logger:Epoch[030/300], Step[0550/1602], Avg Loss: 4.2197, Avg Acc: 0.2040
+INFO:local_logger:Epoch[030/300], Step[0550/1602], Avg Loss: 4.2048, Avg Acc: 0.1917
+INFO:local_logger:Epoch[030/300], Step[0550/1602], Avg Loss: 4.1938, Avg Acc: 0.1825
+INFO:local_logger:Epoch[030/300], Step[0550/1602], Avg Loss: 4.2376, Avg Acc: 0.1762
+INFO:master_logger:Epoch[030/300], Step[0550/1602], Avg Loss: 4.2140, Avg Acc: 0.1886
+INFO:local_logger:Epoch[030/300], Step[0600/1602], Avg Loss: 4.2267, Avg Acc: 0.2017
+INFO:local_logger:Epoch[030/300], Step[0600/1602], Avg Loss: 4.2367, Avg Acc: 0.1761
+INFO:local_logger:Epoch[030/300], Step[0600/1602], Avg Loss: 4.1973, Avg Acc: 0.1834
+INFO:local_logger:Epoch[030/300], Step[0600/1602], Avg Loss: 4.1988, Avg Acc: 0.1941
+INFO:master_logger:Epoch[030/300], Step[0600/1602], Avg Loss: 4.2149, Avg Acc: 0.1888
+INFO:local_logger:Epoch[030/300], Step[0650/1602], Avg Loss: 4.2008, Avg Acc: 0.1956
+INFO:local_logger:Epoch[030/300], Step[0650/1602], Avg Loss: 4.2324, Avg Acc: 0.1992
+INFO:local_logger:Epoch[030/300], Step[0650/1602], Avg Loss: 4.2432, Avg Acc: 0.1761
+INFO:local_logger:Epoch[030/300], Step[0650/1602], Avg Loss: 4.1964, Avg Acc: 0.1866
+INFO:master_logger:Epoch[030/300], Step[0650/1602], Avg Loss: 4.2182, Avg Acc: 0.1894
+INFO:local_logger:Epoch[030/300], Step[0700/1602], Avg Loss: 4.2327, Avg Acc: 0.1959
+INFO:local_logger:Epoch[030/300], Step[0700/1602], Avg Loss: 4.2385, Avg Acc: 0.1771
+INFO:local_logger:Epoch[030/300], Step[0700/1602], Avg Loss: 4.2003, Avg Acc: 0.1875
+INFO:local_logger:Epoch[030/300], Step[0700/1602], Avg Loss: 4.1973, Avg Acc: 0.1963
+INFO:master_logger:Epoch[030/300], Step[0700/1602], Avg Loss: 4.2172, Avg Acc: 0.1892
+INFO:local_logger:Epoch[030/300], Step[0750/1602], Avg Loss: 4.2297, Avg Acc: 0.1962
+INFO:local_logger:Epoch[030/300], Step[0750/1602], Avg Loss: 4.2373, Avg Acc: 0.1776
+INFO:local_logger:Epoch[030/300], Step[0750/1602], Avg Loss: 4.1951, Avg Acc: 0.1889
+INFO:master_logger:Epoch[030/300], Step[0750/1602], Avg Loss: 4.2133, Avg Acc: 0.1897
+INFO:local_logger:Epoch[030/300], Step[0750/1602], Avg Loss: 4.1911, Avg Acc: 0.1961
+INFO:local_logger:Epoch[030/300], Step[0800/1602], Avg Loss: 4.1967, Avg Acc: 0.1950
+INFO:local_logger:Epoch[030/300], Step[0800/1602], Avg Loss: 4.2257, Avg Acc: 0.1959
+INFO:local_logger:Epoch[030/300], Step[0800/1602], Avg Loss: 4.2396, Avg Acc: 0.1769
+INFO:master_logger:Epoch[030/300], Step[0800/1602], Avg Loss: 4.2164, Avg Acc: 0.1890
+INFO:local_logger:Epoch[030/300], Step[0800/1602], Avg Loss: 4.2038, Avg Acc: 0.1882
+INFO:local_logger:Epoch[030/300], Step[0850/1602], Avg Loss: 4.2228, Avg Acc: 0.1961
+INFO:local_logger:Epoch[030/300], Step[0850/1602], Avg Loss: 4.2107, Avg Acc: 0.1888
+INFO:local_logger:Epoch[030/300], Step[0850/1602], Avg Loss: 4.2329, Avg Acc: 0.1781
+INFO:local_logger:Epoch[030/300], Step[0850/1602], Avg Loss: 4.1970, Avg Acc: 0.1926
+INFO:master_logger:Epoch[030/300], Step[0850/1602], Avg Loss: 4.2158, Avg Acc: 0.1889
+INFO:local_logger:Epoch[030/300], Step[0900/1602], Avg Loss: 4.2268, Avg Acc: 0.1952
+INFO:local_logger:Epoch[030/300], Step[0900/1602], Avg Loss: 4.2105, Avg Acc: 0.1899
+INFO:local_logger:Epoch[030/300], Step[0900/1602], Avg Loss: 4.1948, Avg Acc: 0.1927
+INFO:local_logger:Epoch[030/300], Step[0900/1602], Avg Loss: 4.2235, Avg Acc: 0.1800
+INFO:master_logger:Epoch[030/300], Step[0900/1602], Avg Loss: 4.2139, Avg Acc: 0.1894
+INFO:local_logger:Epoch[030/300], Step[0950/1602], Avg Loss: 4.2221, Avg Acc: 0.1963
+INFO:local_logger:Epoch[030/300], Step[0950/1602], Avg Loss: 4.2064, Avg Acc: 0.1897
+INFO:local_logger:Epoch[030/300], Step[0950/1602], Avg Loss: 4.2303, Avg Acc: 0.1792
+INFO:local_logger:Epoch[030/300], Step[0950/1602], Avg Loss: 4.1937, Avg Acc: 0.1930
+INFO:master_logger:Epoch[030/300], Step[0950/1602], Avg Loss: 4.2131, Avg Acc: 0.1895
+INFO:local_logger:Epoch[030/300], Step[1000/1602], Avg Loss: 4.2195, Avg Acc: 0.1959
+INFO:local_logger:Epoch[030/300], Step[1000/1602], Avg Loss: 4.1966, Avg Acc: 0.1931
+INFO:local_logger:Epoch[030/300], Step[1000/1602], Avg Loss: 4.2271, Avg Acc: 0.1788
+INFO:local_logger:Epoch[030/300], Step[1000/1602], Avg Loss: 4.2066, Avg Acc: 0.1891
+INFO:master_logger:Epoch[030/300], Step[1000/1602], Avg Loss: 4.2124, Avg Acc: 0.1892
+INFO:local_logger:Epoch[030/300], Step[1050/1602], Avg Loss: 4.2152, Avg Acc: 0.1962
+INFO:master_logger:Epoch[030/300], Step[1050/1602], Avg Loss: 4.2108, Avg Acc: 0.1899
+INFO:local_logger:Epoch[030/300], Step[1050/1602], Avg Loss: 4.1940, Avg Acc: 0.1944
+INFO:local_logger:Epoch[030/300], Step[1050/1602], Avg Loss: 4.2108, Avg Acc: 0.1890
+INFO:local_logger:Epoch[030/300], Step[1050/1602], Avg Loss: 4.2234, Avg Acc: 0.1800
+INFO:local_logger:Epoch[030/300], Step[1100/1602], Avg Loss: 4.1985, Avg Acc: 0.1924
+INFO:local_logger:Epoch[030/300], Step[1100/1602], Avg Loss: 4.2234, Avg Acc: 0.1807
+INFO:local_logger:Epoch[030/300], Step[1100/1602], Avg Loss: 4.2133, Avg Acc: 0.1967
+INFO:local_logger:Epoch[030/300], Step[1100/1602], Avg Loss: 4.2121, Avg Acc: 0.1892
+INFO:master_logger:Epoch[030/300], Step[1100/1602], Avg Loss: 4.2118, Avg Acc: 0.1898
+INFO:local_logger:Epoch[030/300], Step[1150/1602], Avg Loss: 4.1965, Avg Acc: 0.1925
+INFO:local_logger:Epoch[030/300], Step[1150/1602], Avg Loss: 4.2136, Avg Acc: 0.1955
+INFO:local_logger:Epoch[030/300], Step[1150/1602], Avg Loss: 4.2146, Avg Acc: 0.1896
+INFO:local_logger:Epoch[030/300], Step[1150/1602], Avg Loss: 4.2273, Avg Acc: 0.1809
+INFO:master_logger:Epoch[030/300], Step[1150/1602], Avg Loss: 4.2130, Avg Acc: 0.1896
+INFO:local_logger:Epoch[030/300], Step[1200/1602], Avg Loss: 4.2164, Avg Acc: 0.1955
+INFO:local_logger:Epoch[030/300], Step[1200/1602], Avg Loss: 4.1996, Avg Acc: 0.1916
+INFO:local_logger:Epoch[030/300], Step[1200/1602], Avg Loss: 4.2267, Avg Acc: 0.1820
+INFO:master_logger:Epoch[030/300], Step[1200/1602], Avg Loss: 4.2144, Avg Acc: 0.1895
+INFO:local_logger:Epoch[030/300], Step[1200/1602], Avg Loss: 4.2150, Avg Acc: 0.1891
+INFO:local_logger:Epoch[030/300], Step[1250/1602], Avg Loss: 4.2117, Avg Acc: 0.1968
+INFO:local_logger:Epoch[030/300], Step[1250/1602], Avg Loss: 4.2008, Avg Acc: 0.1908
+INFO:master_logger:Epoch[030/300], Step[1250/1602], Avg Loss: 4.2160, Avg Acc: 0.1900
+INFO:local_logger:Epoch[030/300], Step[1250/1602], Avg Loss: 4.2217, Avg Acc: 0.1898
+INFO:local_logger:Epoch[030/300], Step[1250/1602], Avg Loss: 4.2297, Avg Acc: 0.1825
+INFO:local_logger:Epoch[030/300], Step[1300/1602], Avg Loss: 4.2077, Avg Acc: 0.1967
+INFO:local_logger:Epoch[030/300], Step[1300/1602], Avg Loss: 4.2315, Avg Acc: 0.1823
+INFO:local_logger:Epoch[030/300], Step[1300/1602], Avg Loss: 4.2173, Avg Acc: 0.1897
+INFO:master_logger:Epoch[030/300], Step[1300/1602], Avg Loss: 4.2151, Avg Acc: 0.1900
+INFO:local_logger:Epoch[030/300], Step[1300/1602], Avg Loss: 4.2042, Avg Acc: 0.1914
+INFO:local_logger:Epoch[030/300], Step[1350/1602], Avg Loss: 4.2093, Avg Acc: 0.1959
+INFO:local_logger:Epoch[030/300], Step[1350/1602], Avg Loss: 4.2198, Avg Acc: 0.1895
+INFO:local_logger:Epoch[030/300], Step[1350/1602], Avg Loss: 4.2249, Avg Acc: 0.1831
+INFO:local_logger:Epoch[030/300], Step[1350/1602], Avg Loss: 4.2047, Avg Acc: 0.1911
+INFO:master_logger:Epoch[030/300], Step[1350/1602], Avg Loss: 4.2147, Avg Acc: 0.1899
+INFO:local_logger:Epoch[030/300], Step[1400/1602], Avg Loss: 4.2064, Avg Acc: 0.1913
+INFO:local_logger:Epoch[030/300], Step[1400/1602], Avg Loss: 4.2053, Avg Acc: 0.1949
+INFO:local_logger:Epoch[030/300], Step[1400/1602], Avg Loss: 4.2256, Avg Acc: 0.1825
+INFO:local_logger:Epoch[030/300], Step[1400/1602], Avg Loss: 4.2190, Avg Acc: 0.1897
+INFO:master_logger:Epoch[030/300], Step[1400/1602], Avg Loss: 4.2141, Avg Acc: 0.1896
+INFO:local_logger:Epoch[030/300], Step[1450/1602], Avg Loss: 4.2084, Avg Acc: 0.1941
+INFO:local_logger:Epoch[030/300], Step[1450/1602], Avg Loss: 4.2024, Avg Acc: 0.1924
+INFO:local_logger:Epoch[030/300], Step[1450/1602], Avg Loss: 4.2252, Avg Acc: 0.1821
+INFO:master_logger:Epoch[030/300], Step[1450/1602], Avg Loss: 4.2136, Avg Acc: 0.1896
+INFO:local_logger:Epoch[030/300], Step[1450/1602], Avg Loss: 4.2184, Avg Acc: 0.1899
+INFO:local_logger:Epoch[030/300], Step[1500/1602], Avg Loss: 4.2078, Avg Acc: 0.1948
+INFO:local_logger:Epoch[030/300], Step[1500/1602], Avg Loss: 4.2215, Avg Acc: 0.1891
+INFO:master_logger:Epoch[030/300], Step[1500/1602], Avg Loss: 4.2152, Avg Acc: 0.1896
+INFO:local_logger:Epoch[030/300], Step[1500/1602], Avg Loss: 4.2253, Avg Acc: 0.1822
+INFO:local_logger:Epoch[030/300], Step[1500/1602], Avg Loss: 4.2064, Avg Acc: 0.1923
+INFO:local_logger:Epoch[030/300], Step[1550/1602], Avg Loss: 4.2067, Avg Acc: 0.1950
+INFO:local_logger:Epoch[030/300], Step[1550/1602], Avg Loss: 4.2264, Avg Acc: 0.1834
+INFO:local_logger:Epoch[030/300], Step[1550/1602], Avg Loss: 4.2196, Avg Acc: 0.1904
+INFO:local_logger:Epoch[030/300], Step[1550/1602], Avg Loss: 4.2055, Avg Acc: 0.1924
+INFO:master_logger:Epoch[030/300], Step[1550/1602], Avg Loss: 4.2146, Avg Acc: 0.1903
+INFO:local_logger:Epoch[030/300], Step[1600/1602], Avg Loss: 4.2239, Avg Acc: 0.1835
+INFO:local_logger:Epoch[030/300], Step[1600/1602], Avg Loss: 4.2083, Avg Acc: 0.1951
+INFO:local_logger:Epoch[030/300], Step[1600/1602], Avg Loss: 4.2138, Avg Acc: 0.1907
+INFO:local_logger:Epoch[030/300], Step[1600/1602], Avg Loss: 4.2061, Avg Acc: 0.1916
+INFO:master_logger:Epoch[030/300], Step[1600/1602], Avg Loss: 4.2130, Avg Acc: 0.1902
+INFO:local_logger:----- Epoch[030/300], Train Loss: 4.2061, Train Acc: 0.1915, time: 3710.19
+INFO:local_logger:----- Validation after Epoch: 30
+INFO:local_logger:----- Epoch[030/300], Train Loss: 4.2237, Train Acc: 0.1835, time: 3710.25
+INFO:local_logger:----- Validation after Epoch: 30
+INFO:local_logger:----- Epoch[030/300], Train Loss: 4.2138, Train Acc: 0.1907, time: 3710.23
+INFO:local_logger:----- Validation after Epoch: 30
+INFO:local_logger:----- Epoch[030/300], Train Loss: 4.2085, Train Acc: 0.1951, time: 3709.98
+INFO:master_logger:----- Epoch[030/300], Train Loss: 4.2130, Train Acc: 0.1902, time: 3709.98
+INFO:local_logger:----- Validation after Epoch: 30
+INFO:master_logger:----- Validation after Epoch: 30
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.2170, Avg Acc@1: 1.0000, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.2581, Avg Acc@1: 1.0000, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 1.3611, Avg Acc@1: 0.8750, Avg Acc@5: 0.8750
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.8590, Avg Acc@1: 0.8750, Avg Acc@5: 1.0000
+INFO:master_logger:Val Step[0000/1563], Avg Loss: 0.6738, Avg Acc@1: 0.9375, Avg Acc@5: 0.9688
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.2271, Avg Acc@1: 0.7230, Avg Acc@5: 0.9093
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.1581, Avg Acc@1: 0.7255, Avg Acc@5: 0.8995
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.3720, Avg Acc@1: 0.6765, Avg Acc@5: 0.8652
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.3619, Avg Acc@1: 0.6985, Avg Acc@5: 0.8676
+INFO:master_logger:Val Step[0050/1563], Avg Loss: 1.2798, Avg Acc@1: 0.7059, Avg Acc@5: 0.8854
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.7099, Avg Acc@1: 0.6225, Avg Acc@5: 0.8267
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.7638, Avg Acc@1: 0.5928, Avg Acc@5: 0.8094
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.7850, Avg Acc@1: 0.5879, Avg Acc@5: 0.8094
+INFO:master_logger:Val Step[0100/1563], Avg Loss: 1.7230, Avg Acc@1: 0.6058, Avg Acc@5: 0.8205
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.6333, Avg Acc@1: 0.6200, Avg Acc@5: 0.8366
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.6270, Avg Acc@1: 0.6250, Avg Acc@5: 0.8320
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.5231, Avg Acc@1: 0.6465, Avg Acc@5: 0.8460
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.6435, Avg Acc@1: 0.6242, Avg Acc@5: 0.8295
+INFO:master_logger:Val Step[0150/1563], Avg Loss: 1.5869, Avg Acc@1: 0.6372, Avg Acc@5: 0.8382
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.5539, Avg Acc@1: 0.6531, Avg Acc@5: 0.8452
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.6859, Avg Acc@1: 0.6138, Avg Acc@5: 0.8246
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.5803, Avg Acc@1: 0.6325, Avg Acc@5: 0.8433
+INFO:master_logger:Val Step[0200/1563], Avg Loss: 1.6237, Avg Acc@1: 0.6292, Avg Acc@5: 0.8358
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.5916, Avg Acc@1: 0.6524, Avg Acc@5: 0.8427
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.6371, Avg Acc@1: 0.6182, Avg Acc@5: 0.8327
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.5651, Avg Acc@1: 0.6375, Avg Acc@5: 0.8406
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.5472, Avg Acc@1: 0.6439, Avg Acc@5: 0.8456
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.5270, Avg Acc@1: 0.6584, Avg Acc@5: 0.8516
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.6322, Avg Acc@1: 0.6180, Avg Acc@5: 0.8337
+INFO:master_logger:Val Step[0250/1563], Avg Loss: 1.5679, Avg Acc@1: 0.6394, Avg Acc@5: 0.8429
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.6609, Avg Acc@1: 0.6071, Avg Acc@5: 0.8314
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.6473, Avg Acc@1: 0.6113, Avg Acc@5: 0.8322
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.6407, Avg Acc@1: 0.6238, Avg Acc@5: 0.8364
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.7111, Avg Acc@1: 0.5939, Avg Acc@5: 0.8248
+INFO:master_logger:Val Step[0300/1563], Avg Loss: 1.6650, Avg Acc@1: 0.6090, Avg Acc@5: 0.8312
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.6925, Avg Acc@1: 0.5965, Avg Acc@5: 0.8294
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.6467, Avg Acc@1: 0.6129, Avg Acc@5: 0.8383
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.6872, Avg Acc@1: 0.5969, Avg Acc@5: 0.8291
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.7209, Avg Acc@1: 0.5869, Avg Acc@5: 0.8305
+INFO:master_logger:Val Step[0350/1563], Avg Loss: 1.6868, Avg Acc@1: 0.5983, Avg Acc@5: 0.8318
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.6918, Avg Acc@1: 0.5876, Avg Acc@5: 0.8298
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.7317, Avg Acc@1: 0.5810, Avg Acc@5: 0.8307
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.6546, Avg Acc@1: 0.6075, Avg Acc@5: 0.8392
+INFO:master_logger:Val Step[0400/1563], Avg Loss: 1.6978, Avg Acc@1: 0.5896, Avg Acc@5: 0.8323
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.7132, Avg Acc@1: 0.5823, Avg Acc@5: 0.8295
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.7334, Avg Acc@1: 0.5784, Avg Acc@5: 0.8298
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.6971, Avg Acc@1: 0.5837, Avg Acc@5: 0.8315
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.7272, Avg Acc@1: 0.5784, Avg Acc@5: 0.8279
+INFO:master_logger:Val Step[0450/1563], Avg Loss: 1.7089, Avg Acc@1: 0.5847, Avg Acc@5: 0.8318
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.6779, Avg Acc@1: 0.5984, Avg Acc@5: 0.8381
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.7128, Avg Acc@1: 0.5843, Avg Acc@5: 0.8338
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.6932, Avg Acc@1: 0.5831, Avg Acc@5: 0.8336
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.6629, Avg Acc@1: 0.5988, Avg Acc@5: 0.8423
+INFO:master_logger:Val Step[0500/1563], Avg Loss: 1.6968, Avg Acc@1: 0.5863, Avg Acc@5: 0.8351
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.7183, Avg Acc@1: 0.5791, Avg Acc@5: 0.8306
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.6874, Avg Acc@1: 0.5873, Avg Acc@5: 0.8337
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.6572, Avg Acc@1: 0.5937, Avg Acc@5: 0.8382
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.6201, Avg Acc@1: 0.6069, Avg Acc@5: 0.8480
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.6818, Avg Acc@1: 0.5923, Avg Acc@5: 0.8371
+INFO:master_logger:Val Step[0550/1563], Avg Loss: 1.6616, Avg Acc@1: 0.5951, Avg Acc@5: 0.8393
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.6952, Avg Acc@1: 0.5859, Avg Acc@5: 0.8322
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.6338, Avg Acc@1: 0.6054, Avg Acc@5: 0.8455
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.6946, Avg Acc@1: 0.5913, Avg Acc@5: 0.8332
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.6655, Avg Acc@1: 0.5944, Avg Acc@5: 0.8384
+INFO:master_logger:Val Step[0600/1563], Avg Loss: 1.6723, Avg Acc@1: 0.5943, Avg Acc@5: 0.8373
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.7007, Avg Acc@1: 0.5912, Avg Acc@5: 0.8337
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.6635, Avg Acc@1: 0.5997, Avg Acc@5: 0.8424
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.7127, Avg Acc@1: 0.5889, Avg Acc@5: 0.8306
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.7244, Avg Acc@1: 0.5808, Avg Acc@5: 0.8278
+INFO:master_logger:Val Step[0650/1563], Avg Loss: 1.7003, Avg Acc@1: 0.5901, Avg Acc@5: 0.8336
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.7533, Avg Acc@1: 0.5815, Avg Acc@5: 0.8244
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.7772, Avg Acc@1: 0.5724, Avg Acc@5: 0.8201
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.7183, Avg Acc@1: 0.5918, Avg Acc@5: 0.8333
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.7620, Avg Acc@1: 0.5794, Avg Acc@5: 0.8231
+INFO:master_logger:Val Step[0700/1563], Avg Loss: 1.7527, Avg Acc@1: 0.5813, Avg Acc@5: 0.8252
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.8098, Avg Acc@1: 0.5692, Avg Acc@5: 0.8154
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.8126, Avg Acc@1: 0.5696, Avg Acc@5: 0.8156
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.8322, Avg Acc@1: 0.5626, Avg Acc@5: 0.8086
+INFO:master_logger:Val Step[0750/1563], Avg Loss: 1.8096, Avg Acc@1: 0.5704, Avg Acc@5: 0.8152
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.7836, Avg Acc@1: 0.5804, Avg Acc@5: 0.8214
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.8671, Avg Acc@1: 0.5591, Avg Acc@5: 0.8065
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.8737, Avg Acc@1: 0.5570, Avg Acc@5: 0.8035
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.8413, Avg Acc@1: 0.5674, Avg Acc@5: 0.8121
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.8852, Avg Acc@1: 0.5532, Avg Acc@5: 0.8009
+INFO:master_logger:Val Step[0800/1563], Avg Loss: 1.8668, Avg Acc@1: 0.5592, Avg Acc@5: 0.8058
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.9064, Avg Acc@1: 0.5514, Avg Acc@5: 0.7979
+INFO:master_logger:Val Step[0850/1563], Avg Loss: 1.9090, Avg Acc@1: 0.5518, Avg Acc@5: 0.7985
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.9057, Avg Acc@1: 0.5526, Avg Acc@5: 0.8002
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.8878, Avg Acc@1: 0.5598, Avg Acc@5: 0.8038
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.9359, Avg Acc@1: 0.5433, Avg Acc@5: 0.7923
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.9164, Avg Acc@1: 0.5509, Avg Acc@5: 0.7969
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.9161, Avg Acc@1: 0.5530, Avg Acc@5: 0.7976
+INFO:master_logger:Val Step[0900/1563], Avg Loss: 1.9197, Avg Acc@1: 0.5512, Avg Acc@5: 0.7964
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.8964, Avg Acc@1: 0.5595, Avg Acc@5: 0.8017
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.9497, Avg Acc@1: 0.5415, Avg Acc@5: 0.7895
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.9477, Avg Acc@1: 0.5480, Avg Acc@5: 0.7928
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.9944, Avg Acc@1: 0.5334, Avg Acc@5: 0.7814
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.9533, Avg Acc@1: 0.5455, Avg Acc@5: 0.7884
+INFO:master_logger:Val Step[0950/1563], Avg Loss: 1.9569, Avg Acc@1: 0.5449, Avg Acc@5: 0.7897
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.9320, Avg Acc@1: 0.5527, Avg Acc@5: 0.7960
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.9788, Avg Acc@1: 0.5420, Avg Acc@5: 0.7878
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 2.0249, Avg Acc@1: 0.5275, Avg Acc@5: 0.7762
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.9700, Avg Acc@1: 0.5466, Avg Acc@5: 0.7897
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.9976, Avg Acc@1: 0.5373, Avg Acc@5: 0.7817
+INFO:master_logger:Val Step[1000/1563], Avg Loss: 1.9928, Avg Acc@1: 0.5383, Avg Acc@5: 0.7839
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 1.9963, Avg Acc@1: 0.5383, Avg Acc@5: 0.7852
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 2.0221, Avg Acc@1: 0.5318, Avg Acc@5: 0.7790
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 1.9931, Avg Acc@1: 0.5419, Avg Acc@5: 0.7860
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 2.0480, Avg Acc@1: 0.5233, Avg Acc@5: 0.7724
+INFO:master_logger:Val Step[1050/1563], Avg Loss: 2.0149, Avg Acc@1: 0.5338, Avg Acc@5: 0.7807
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.0769, Avg Acc@1: 0.5176, Avg Acc@5: 0.7674
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.0224, Avg Acc@1: 0.5342, Avg Acc@5: 0.7800
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.0482, Avg Acc@1: 0.5276, Avg Acc@5: 0.7741
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 2.0279, Avg Acc@1: 0.5359, Avg Acc@5: 0.7788
+INFO:master_logger:Val Step[1100/1563], Avg Loss: 2.0438, Avg Acc@1: 0.5288, Avg Acc@5: 0.7751
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.0497, Avg Acc@1: 0.5301, Avg Acc@5: 0.7753
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.0707, Avg Acc@1: 0.5245, Avg Acc@5: 0.7703
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.0544, Avg Acc@1: 0.5308, Avg Acc@5: 0.7742
+INFO:master_logger:Val Step[1150/1563], Avg Loss: 2.0697, Avg Acc@1: 0.5249, Avg Acc@5: 0.7705
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 2.1041, Avg Acc@1: 0.5141, Avg Acc@5: 0.7621
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.0956, Avg Acc@1: 0.5215, Avg Acc@5: 0.7653
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.1291, Avg Acc@1: 0.5110, Avg Acc@5: 0.7581
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.0806, Avg Acc@1: 0.5269, Avg Acc@5: 0.7705
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 2.0725, Avg Acc@1: 0.5264, Avg Acc@5: 0.7704
+INFO:master_logger:Val Step[1200/1563], Avg Loss: 2.0945, Avg Acc@1: 0.5215, Avg Acc@5: 0.7661
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.1160, Avg Acc@1: 0.5186, Avg Acc@5: 0.7623
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.1461, Avg Acc@1: 0.5093, Avg Acc@5: 0.7565
+INFO:master_logger:Val Step[1250/1563], Avg Loss: 2.1155, Avg Acc@1: 0.5188, Avg Acc@5: 0.7630
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.0916, Avg Acc@1: 0.5239, Avg Acc@5: 0.7673
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 2.1084, Avg Acc@1: 0.5234, Avg Acc@5: 0.7660
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.1380, Avg Acc@1: 0.5170, Avg Acc@5: 0.7604
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.1140, Avg Acc@1: 0.5194, Avg Acc@5: 0.7635
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.1620, Avg Acc@1: 0.5065, Avg Acc@5: 0.7537
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 2.1312, Avg Acc@1: 0.5155, Avg Acc@5: 0.7609
+INFO:master_logger:Val Step[1300/1563], Avg Loss: 2.1363, Avg Acc@1: 0.5146, Avg Acc@5: 0.7596
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.1594, Avg Acc@1: 0.5096, Avg Acc@5: 0.7568
+INFO:master_logger:Val Step[1350/1563], Avg Loss: 2.1624, Avg Acc@1: 0.5095, Avg Acc@5: 0.7552
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.1401, Avg Acc@1: 0.5144, Avg Acc@5: 0.7589
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.1581, Avg Acc@1: 0.5133, Avg Acc@5: 0.7566
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 2.1920, Avg Acc@1: 0.5006, Avg Acc@5: 0.7486
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.1720, Avg Acc@1: 0.5055, Avg Acc@5: 0.7543
+INFO:master_logger:Val Step[1400/1563], Avg Loss: 2.1717, Avg Acc@1: 0.5071, Avg Acc@5: 0.7537
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.1645, Avg Acc@1: 0.5121, Avg Acc@5: 0.7555
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.1509, Avg Acc@1: 0.5125, Avg Acc@5: 0.7573
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.1994, Avg Acc@1: 0.4983, Avg Acc@5: 0.7476
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.1695, Avg Acc@1: 0.5117, Avg Acc@5: 0.7548
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.2117, Avg Acc@1: 0.4963, Avg Acc@5: 0.7457
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.1741, Avg Acc@1: 0.5053, Avg Acc@5: 0.7537
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.1572, Avg Acc@1: 0.5109, Avg Acc@5: 0.7559
+INFO:master_logger:Val Step[1450/1563], Avg Loss: 2.1781, Avg Acc@1: 0.5061, Avg Acc@5: 0.7525
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.1556, Avg Acc@1: 0.5081, Avg Acc@5: 0.7567
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.1491, Avg Acc@1: 0.5151, Avg Acc@5: 0.7584
+INFO:master_logger:Val Step[1500/1563], Avg Loss: 2.1571, Avg Acc@1: 0.5101, Avg Acc@5: 0.7559
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.1934, Avg Acc@1: 0.5004, Avg Acc@5: 0.7488
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 2.1305, Avg Acc@1: 0.5169, Avg Acc@5: 0.7597
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.1413, Avg Acc@1: 0.5110, Avg Acc@5: 0.7587
+INFO:master_logger:Val Step[1550/1563], Avg Loss: 2.1482, Avg Acc@1: 0.5122, Avg Acc@5: 0.7575
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.1245, Avg Acc@1: 0.5185, Avg Acc@5: 0.7610
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.1810, Avg Acc@1: 0.5038, Avg Acc@5: 0.7510
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 2.1460, Avg Acc@1: 0.5156, Avg Acc@5: 0.7593
+INFO:local_logger:----- Epoch[030/300], Validation Loss: 2.1432, Validation Acc@1: 0.5162, Validation Acc@5: 0.7593, time: 181.94
+INFO:local_logger:Now training epoch 31. LR=0.000383
+INFO:local_logger:----- Epoch[030/300], Validation Loss: 2.1366, Validation Acc@1: 0.5120, Validation Acc@5: 0.7593, time: 182.04
+INFO:master_logger:----- Epoch[030/300], Validation Loss: 2.1443, Validation Acc@1: 0.5132, Validation Acc@5: 0.7580, time: 182.04
+INFO:local_logger:----- Epoch[030/300], Validation Loss: 2.1210, Validation Acc@1: 0.5197, Validation Acc@5: 0.7618, time: 182.06
+INFO:local_logger:Now training epoch 31. LR=0.000383
+INFO:local_logger:----- Epoch[030/300], Validation Loss: 2.1765, Validation Acc@1: 0.5049, Validation Acc@5: 0.7516, time: 182.06
+INFO:local_logger:Now training epoch 31. LR=0.000383
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-30-Loss-4.208452598332195.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-30-Loss-4.208452598332195.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-30-Loss-4.208452598332195-EMA.pdparams
+INFO:local_logger:Now training epoch 31. LR=0.000383
+INFO:master_logger:Now training epoch 31. LR=0.000383
+INFO:local_logger:Epoch[031/300], Step[0000/1602], Avg Loss: 4.6664, Avg Acc: 0.0250
+INFO:local_logger:Epoch[031/300], Step[0000/1602], Avg Loss: 4.2319, Avg Acc: 0.2800
+INFO:master_logger:Epoch[031/300], Step[0000/1602], Avg Loss: 4.2688, Avg Acc: 0.1725
+INFO:local_logger:Epoch[031/300], Step[0000/1602], Avg Loss: 4.7380, Avg Acc: 0.0550
+INFO:local_logger:Epoch[031/300], Step[0000/1602], Avg Loss: 3.4387, Avg Acc: 0.3300
+INFO:local_logger:Epoch[031/300], Step[0050/1602], Avg Loss: 4.1119, Avg Acc: 0.1866
+INFO:local_logger:Epoch[031/300], Step[0050/1602], Avg Loss: 4.1781, Avg Acc: 0.1906
+INFO:local_logger:Epoch[031/300], Step[0050/1602], Avg Loss: 4.2040, Avg Acc: 0.2076
+INFO:local_logger:Epoch[031/300], Step[0050/1602], Avg Loss: 4.3306, Avg Acc: 0.1495
+INFO:master_logger:Epoch[031/300], Step[0050/1602], Avg Loss: 4.2062, Avg Acc: 0.1836
+INFO:local_logger:Epoch[031/300], Step[0100/1602], Avg Loss: 4.0964, Avg Acc: 0.2004
+INFO:local_logger:Epoch[031/300], Step[0100/1602], Avg Loss: 4.1199, Avg Acc: 0.2096
+INFO:local_logger:Epoch[031/300], Step[0100/1602], Avg Loss: 4.1615, Avg Acc: 0.2100
+INFO:local_logger:Epoch[031/300], Step[0100/1602], Avg Loss: 4.2704, Avg Acc: 0.1607
+INFO:master_logger:Epoch[031/300], Step[0100/1602], Avg Loss: 4.1621, Avg Acc: 0.1952
+INFO:local_logger:Epoch[031/300], Step[0150/1602], Avg Loss: 4.1561, Avg Acc: 0.2018
+INFO:local_logger:Epoch[031/300], Step[0150/1602], Avg Loss: 4.2802, Avg Acc: 0.1726
+INFO:local_logger:Epoch[031/300], Step[0150/1602], Avg Loss: 4.1250, Avg Acc: 0.2047
+INFO:local_logger:Epoch[031/300], Step[0150/1602], Avg Loss: 4.1800, Avg Acc: 0.1996
+INFO:master_logger:Epoch[031/300], Step[0150/1602], Avg Loss: 4.1853, Avg Acc: 0.1947
+INFO:local_logger:Epoch[031/300], Step[0200/1602], Avg Loss: 4.1616, Avg Acc: 0.1940
+INFO:local_logger:Epoch[031/300], Step[0200/1602], Avg Loss: 4.1604, Avg Acc: 0.1985
+INFO:local_logger:Epoch[031/300], Step[0200/1602], Avg Loss: 4.1630, Avg Acc: 0.2005
+INFO:local_logger:Epoch[031/300], Step[0200/1602], Avg Loss: 4.2307, Avg Acc: 0.1810
+INFO:master_logger:Epoch[031/300], Step[0200/1602], Avg Loss: 4.1790, Avg Acc: 0.1935
+INFO:local_logger:Epoch[031/300], Step[0250/1602], Avg Loss: 4.1440, Avg Acc: 0.1951
+INFO:local_logger:Epoch[031/300], Step[0250/1602], Avg Loss: 4.1699, Avg Acc: 0.1943
+INFO:local_logger:Epoch[031/300], Step[0250/1602], Avg Loss: 4.1409, Avg Acc: 0.1958
+INFO:master_logger:Epoch[031/300], Step[0250/1602], Avg Loss: 4.1652, Avg Acc: 0.1923
+INFO:local_logger:Epoch[031/300], Step[0250/1602], Avg Loss: 4.2058, Avg Acc: 0.1840
+INFO:local_logger:Epoch[031/300], Step[0300/1602], Avg Loss: 4.1763, Avg Acc: 0.1898
+INFO:local_logger:Epoch[031/300], Step[0300/1602], Avg Loss: 4.1510, Avg Acc: 0.1961
+INFO:local_logger:Epoch[031/300], Step[0300/1602], Avg Loss: 4.1642, Avg Acc: 0.1917
+INFO:master_logger:Epoch[031/300], Step[0300/1602], Avg Loss: 4.1774, Avg Acc: 0.1900
+INFO:local_logger:Epoch[031/300], Step[0300/1602], Avg Loss: 4.2183, Avg Acc: 0.1826
+INFO:local_logger:Epoch[031/300], Step[0350/1602], Avg Loss: 4.1571, Avg Acc: 0.1976
+INFO:local_logger:Epoch[031/300], Step[0350/1602], Avg Loss: 4.2128, Avg Acc: 0.1795
+INFO:local_logger:Epoch[031/300], Step[0350/1602], Avg Loss: 4.1548, Avg Acc: 0.1931
+INFO:master_logger:Epoch[031/300], Step[0350/1602], Avg Loss: 4.1750, Avg Acc: 0.1899
+INFO:local_logger:Epoch[031/300], Step[0350/1602], Avg Loss: 4.1754, Avg Acc: 0.1895
+INFO:local_logger:Epoch[031/300], Step[0400/1602], Avg Loss: 4.1672, Avg Acc: 0.1963
+INFO:local_logger:Epoch[031/300], Step[0400/1602], Avg Loss: 4.1643, Avg Acc: 0.1919
+INFO:local_logger:Epoch[031/300], Step[0400/1602], Avg Loss: 4.1933, Avg Acc: 0.1864
+INFO:local_logger:Epoch[031/300], Step[0400/1602], Avg Loss: 4.1803, Avg Acc: 0.1920
+INFO:master_logger:Epoch[031/300], Step[0400/1602], Avg Loss: 4.1763, Avg Acc: 0.1916
+INFO:local_logger:Epoch[031/300], Step[0450/1602], Avg Loss: 4.1665, Avg Acc: 0.1956
+INFO:local_logger:Epoch[031/300], Step[0450/1602], Avg Loss: 4.1712, Avg Acc: 0.1962
+INFO:local_logger:Epoch[031/300], Step[0450/1602], Avg Loss: 4.1755, Avg Acc: 0.1910
+INFO:local_logger:Epoch[031/300], Step[0450/1602], Avg Loss: 4.1907, Avg Acc: 0.1865
+INFO:master_logger:Epoch[031/300], Step[0450/1602], Avg Loss: 4.1760, Avg Acc: 0.1923
+INFO:local_logger:Epoch[031/300], Step[0500/1602], Avg Loss: 4.1629, Avg Acc: 0.1984
+INFO:local_logger:Epoch[031/300], Step[0500/1602], Avg Loss: 4.1672, Avg Acc: 0.1968
+INFO:local_logger:Epoch[031/300], Step[0500/1602], Avg Loss: 4.1718, Avg Acc: 0.1905
+INFO:master_logger:Epoch[031/300], Step[0500/1602], Avg Loss: 4.1706, Avg Acc: 0.1928
+INFO:local_logger:Epoch[031/300], Step[0500/1602], Avg Loss: 4.1805, Avg Acc: 0.1857
+INFO:local_logger:Epoch[031/300], Step[0550/1602], Avg Loss: 4.1759, Avg Acc: 0.1898
+INFO:local_logger:Epoch[031/300], Step[0550/1602], Avg Loss: 4.1539, Avg Acc: 0.2030
+INFO:local_logger:Epoch[031/300], Step[0550/1602], Avg Loss: 4.1782, Avg Acc: 0.1964
+INFO:local_logger:Epoch[031/300], Step[0550/1602], Avg Loss: 4.1736, Avg Acc: 0.1912
+INFO:master_logger:Epoch[031/300], Step[0550/1602], Avg Loss: 4.1704, Avg Acc: 0.1951
+INFO:local_logger:Epoch[031/300], Step[0600/1602], Avg Loss: 4.1537, Avg Acc: 0.2045
+INFO:local_logger:Epoch[031/300], Step[0600/1602], Avg Loss: 4.1717, Avg Acc: 0.1907
+INFO:local_logger:Epoch[031/300], Step[0600/1602], Avg Loss: 4.1742, Avg Acc: 0.1962
+INFO:master_logger:Epoch[031/300], Step[0600/1602], Avg Loss: 4.1698, Avg Acc: 0.1952
+INFO:local_logger:Epoch[031/300], Step[0600/1602], Avg Loss: 4.1796, Avg Acc: 0.1893
+INFO:local_logger:Epoch[031/300], Step[0650/1602], Avg Loss: 4.1596, Avg Acc: 0.2041
+INFO:local_logger:Epoch[031/300], Step[0650/1602], Avg Loss: 4.1746, Avg Acc: 0.1958
+INFO:local_logger:Epoch[031/300], Step[0650/1602], Avg Loss: 4.1840, Avg Acc: 0.1904
+INFO:local_logger:Epoch[031/300], Step[0650/1602], Avg Loss: 4.1679, Avg Acc: 0.1941
+INFO:master_logger:Epoch[031/300], Step[0650/1602], Avg Loss: 4.1715, Avg Acc: 0.1961
+INFO:local_logger:Epoch[031/300], Step[0700/1602], Avg Loss: 4.1601, Avg Acc: 0.2024
+INFO:master_logger:Epoch[031/300], Step[0700/1602], Avg Loss: 4.1697, Avg Acc: 0.1964
+INFO:local_logger:Epoch[031/300], Step[0700/1602], Avg Loss: 4.1724, Avg Acc: 0.1953
+INFO:local_logger:Epoch[031/300], Step[0700/1602], Avg Loss: 4.1651, Avg Acc: 0.1955
+INFO:local_logger:Epoch[031/300], Step[0700/1602], Avg Loss: 4.1813, Avg Acc: 0.1925
+INFO:local_logger:Epoch[031/300], Step[0750/1602], Avg Loss: 4.1554, Avg Acc: 0.2020
+INFO:local_logger:Epoch[031/300], Step[0750/1602], Avg Loss: 4.1695, Avg Acc: 0.1956
+INFO:local_logger:Epoch[031/300], Step[0750/1602], Avg Loss: 4.1644, Avg Acc: 0.1963
+INFO:master_logger:Epoch[031/300], Step[0750/1602], Avg Loss: 4.1680, Avg Acc: 0.1962
+INFO:local_logger:Epoch[031/300], Step[0750/1602], Avg Loss: 4.1828, Avg Acc: 0.1911
+INFO:local_logger:Epoch[031/300], Step[0800/1602], Avg Loss: 4.1571, Avg Acc: 0.2036
+INFO:local_logger:Epoch[031/300], Step[0800/1602], Avg Loss: 4.1788, Avg Acc: 0.1926
+INFO:local_logger:Epoch[031/300], Step[0800/1602], Avg Loss: 4.1611, Avg Acc: 0.1964
+INFO:local_logger:Epoch[031/300], Step[0800/1602], Avg Loss: 4.1580, Avg Acc: 0.1955
+INFO:master_logger:Epoch[031/300], Step[0800/1602], Avg Loss: 4.1637, Avg Acc: 0.1970
+INFO:local_logger:Epoch[031/300], Step[0850/1602], Avg Loss: 4.1587, Avg Acc: 0.2023
+INFO:local_logger:Epoch[031/300], Step[0850/1602], Avg Loss: 4.1874, Avg Acc: 0.1919
+INFO:local_logger:Epoch[031/300], Step[0850/1602], Avg Loss: 4.1625, Avg Acc: 0.1949
+INFO:local_logger:Epoch[031/300], Step[0850/1602], Avg Loss: 4.1682, Avg Acc: 0.1936
+INFO:master_logger:Epoch[031/300], Step[0850/1602], Avg Loss: 4.1692, Avg Acc: 0.1957
+INFO:local_logger:Epoch[031/300], Step[0900/1602], Avg Loss: 4.1560, Avg Acc: 0.1936
+INFO:local_logger:Epoch[031/300], Step[0900/1602], Avg Loss: 4.1608, Avg Acc: 0.2000
+INFO:local_logger:Epoch[031/300], Step[0900/1602], Avg Loss: 4.1959, Avg Acc: 0.1906
+INFO:master_logger:Epoch[031/300], Step[0900/1602], Avg Loss: 4.1681, Avg Acc: 0.1947
+INFO:local_logger:Epoch[031/300], Step[0900/1602], Avg Loss: 4.1595, Avg Acc: 0.1947
+INFO:local_logger:Epoch[031/300], Step[0950/1602], Avg Loss: 4.1559, Avg Acc: 0.2019
+INFO:local_logger:Epoch[031/300], Step[0950/1602], Avg Loss: 4.1914, Avg Acc: 0.1888
+INFO:local_logger:Epoch[031/300], Step[0950/1602], Avg Loss: 4.1570, Avg Acc: 0.1924
+INFO:local_logger:Epoch[031/300], Step[0950/1602], Avg Loss: 4.1584, Avg Acc: 0.1952
+INFO:master_logger:Epoch[031/300], Step[0950/1602], Avg Loss: 4.1657, Avg Acc: 0.1946
+INFO:local_logger:Epoch[031/300], Step[1000/1602], Avg Loss: 4.1593, Avg Acc: 0.1999
+INFO:local_logger:Epoch[031/300], Step[1000/1602], Avg Loss: 4.1531, Avg Acc: 0.1934
+INFO:local_logger:Epoch[031/300], Step[1000/1602], Avg Loss: 4.1610, Avg Acc: 0.1938
+INFO:local_logger:Epoch[031/300], Step[1000/1602], Avg Loss: 4.1882, Avg Acc: 0.1901
+INFO:master_logger:Epoch[031/300], Step[1000/1602], Avg Loss: 4.1654, Avg Acc: 0.1943
+INFO:local_logger:Epoch[031/300], Step[1050/1602], Avg Loss: 4.1811, Avg Acc: 0.1892
+INFO:local_logger:Epoch[031/300], Step[1050/1602], Avg Loss: 4.1612, Avg Acc: 0.1994
+INFO:local_logger:Epoch[031/300], Step[1050/1602], Avg Loss: 4.1634, Avg Acc: 0.1921
+INFO:local_logger:Epoch[031/300], Step[1050/1602], Avg Loss: 4.1591, Avg Acc: 0.1933
+INFO:master_logger:Epoch[031/300], Step[1050/1602], Avg Loss: 4.1662, Avg Acc: 0.1935
+INFO:local_logger:Epoch[031/300], Step[1100/1602], Avg Loss: 4.1597, Avg Acc: 0.1944
+INFO:local_logger:Epoch[031/300], Step[1100/1602], Avg Loss: 4.1652, Avg Acc: 0.1920
+INFO:local_logger:Epoch[031/300], Step[1100/1602], Avg Loss: 4.1539, Avg Acc: 0.1987
+INFO:local_logger:Epoch[031/300], Step[1100/1602], Avg Loss: 4.1819, Avg Acc: 0.1896
+INFO:master_logger:Epoch[031/300], Step[1100/1602], Avg Loss: 4.1652, Avg Acc: 0.1937
+INFO:local_logger:Epoch[031/300], Step[1150/1602], Avg Loss: 4.1535, Avg Acc: 0.1973
+INFO:local_logger:Epoch[031/300], Step[1150/1602], Avg Loss: 4.1581, Avg Acc: 0.1949
+INFO:local_logger:Epoch[031/300], Step[1150/1602], Avg Loss: 4.1709, Avg Acc: 0.1908
+INFO:local_logger:Epoch[031/300], Step[1150/1602], Avg Loss: 4.1792, Avg Acc: 0.1886
+INFO:master_logger:Epoch[031/300], Step[1150/1602], Avg Loss: 4.1654, Avg Acc: 0.1929
+INFO:local_logger:Epoch[031/300], Step[1200/1602], Avg Loss: 4.1542, Avg Acc: 0.1964
+INFO:local_logger:Epoch[031/300], Step[1200/1602], Avg Loss: 4.1635, Avg Acc: 0.1946
+INFO:local_logger:Epoch[031/300], Step[1200/1602], Avg Loss: 4.1754, Avg Acc: 0.1897
+INFO:master_logger:Epoch[031/300], Step[1200/1602], Avg Loss: 4.1674, Avg Acc: 0.1925
+INFO:local_logger:Epoch[031/300], Step[1200/1602], Avg Loss: 4.1764, Avg Acc: 0.1892
+INFO:local_logger:Epoch[031/300], Step[1250/1602], Avg Loss: 4.1602, Avg Acc: 0.1951
+INFO:local_logger:Epoch[031/300], Step[1250/1602], Avg Loss: 4.1642, Avg Acc: 0.1942
+INFO:local_logger:Epoch[031/300], Step[1250/1602], Avg Loss: 4.1728, Avg Acc: 0.1910
+INFO:local_logger:Epoch[031/300], Step[1250/1602], Avg Loss: 4.1739, Avg Acc: 0.1903
+INFO:master_logger:Epoch[031/300], Step[1250/1602], Avg Loss: 4.1678, Avg Acc: 0.1926
+INFO:local_logger:Epoch[031/300], Step[1300/1602], Avg Loss: 4.1641, Avg Acc: 0.1941
+INFO:local_logger:Epoch[031/300], Step[1300/1602], Avg Loss: 4.1768, Avg Acc: 0.1912
+INFO:local_logger:Epoch[031/300], Step[1300/1602], Avg Loss: 4.1669, Avg Acc: 0.1936
+INFO:master_logger:Epoch[031/300], Step[1300/1602], Avg Loss: 4.1690, Avg Acc: 0.1924
+INFO:local_logger:Epoch[031/300], Step[1300/1602], Avg Loss: 4.1680, Avg Acc: 0.1907
+INFO:local_logger:Epoch[031/300], Step[1350/1602], Avg Loss: 4.1629, Avg Acc: 0.1937
+INFO:local_logger:Epoch[031/300], Step[1350/1602], Avg Loss: 4.1795, Avg Acc: 0.1894
+INFO:local_logger:Epoch[031/300], Step[1350/1602], Avg Loss: 4.1667, Avg Acc: 0.1908
+INFO:master_logger:Epoch[031/300], Step[1350/1602], Avg Loss: 4.1689, Avg Acc: 0.1920
+INFO:local_logger:Epoch[031/300], Step[1350/1602], Avg Loss: 4.1665, Avg Acc: 0.1941
+INFO:local_logger:Epoch[031/300], Step[1400/1602], Avg Loss: 4.1673, Avg Acc: 0.1935
+INFO:local_logger:Epoch[031/300], Step[1400/1602], Avg Loss: 4.1654, Avg Acc: 0.1931
+INFO:local_logger:Epoch[031/300], Step[1400/1602], Avg Loss: 4.1760, Avg Acc: 0.1902
+INFO:local_logger:Epoch[031/300], Step[1400/1602], Avg Loss: 4.1668, Avg Acc: 0.1912
+INFO:master_logger:Epoch[031/300], Step[1400/1602], Avg Loss: 4.1689, Avg Acc: 0.1920
+INFO:local_logger:Epoch[031/300], Step[1450/1602], Avg Loss: 4.1688, Avg Acc: 0.1931
+INFO:local_logger:Epoch[031/300], Step[1450/1602], Avg Loss: 4.1676, Avg Acc: 0.1942
+INFO:local_logger:Epoch[031/300], Step[1450/1602], Avg Loss: 4.1695, Avg Acc: 0.1924
+INFO:local_logger:Epoch[031/300], Step[1450/1602], Avg Loss: 4.1782, Avg Acc: 0.1907
+INFO:master_logger:Epoch[031/300], Step[1450/1602], Avg Loss: 4.1710, Avg Acc: 0.1926
+INFO:local_logger:Epoch[031/300], Step[1500/1602], Avg Loss: 4.1694, Avg Acc: 0.1918
+INFO:local_logger:Epoch[031/300], Step[1500/1602], Avg Loss: 4.1727, Avg Acc: 0.1929
+INFO:local_logger:Epoch[031/300], Step[1500/1602], Avg Loss: 4.1742, Avg Acc: 0.1910
+INFO:local_logger:Epoch[031/300], Step[1500/1602], Avg Loss: 4.1674, Avg Acc: 0.1930
+INFO:master_logger:Epoch[031/300], Step[1500/1602], Avg Loss: 4.1709, Avg Acc: 0.1922
+INFO:local_logger:Epoch[031/300], Step[1550/1602], Avg Loss: 4.1733, Avg Acc: 0.1926
+INFO:local_logger:Epoch[031/300], Step[1550/1602], Avg Loss: 4.1698, Avg Acc: 0.1926
+INFO:local_logger:Epoch[031/300], Step[1550/1602], Avg Loss: 4.1726, Avg Acc: 0.1907
+INFO:local_logger:Epoch[031/300], Step[1550/1602], Avg Loss: 4.1668, Avg Acc: 0.1927
+INFO:master_logger:Epoch[031/300], Step[1550/1602], Avg Loss: 4.1706, Avg Acc: 0.1921
+INFO:local_logger:Epoch[031/300], Step[1600/1602], Avg Loss: 4.1746, Avg Acc: 0.1930
+INFO:local_logger:Epoch[031/300], Step[1600/1602], Avg Loss: 4.1687, Avg Acc: 0.1919
+INFO:local_logger:Epoch[031/300], Step[1600/1602], Avg Loss: 4.1635, Avg Acc: 0.1934
+INFO:local_logger:Epoch[031/300], Step[1600/1602], Avg Loss: 4.1715, Avg Acc: 0.1924
+INFO:master_logger:Epoch[031/300], Step[1600/1602], Avg Loss: 4.1696, Avg Acc: 0.1927
+INFO:local_logger:----- Epoch[031/300], Train Loss: 4.1746, Train Acc: 0.1930, time: 3718.22
+INFO:local_logger:Now training epoch 32. LR=0.000383
+INFO:local_logger:----- Epoch[031/300], Train Loss: 4.1636, Train Acc: 0.1934, time: 3718.17
+INFO:local_logger:Now training epoch 32. LR=0.000383
+INFO:local_logger:----- Epoch[031/300], Train Loss: 4.1689, Train Acc: 0.1919, time: 3718.56
+INFO:local_logger:----- Epoch[031/300], Train Loss: 4.1713, Train Acc: 0.1925, time: 3718.31
+INFO:local_logger:Now training epoch 32. LR=0.000383
+INFO:master_logger:----- Epoch[031/300], Train Loss: 4.1696, Train Acc: 0.1927, time: 3718.31
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-31-Loss-4.171322355231916.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-31-Loss-4.171322355231916.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-31-Loss-4.171322355231916-EMA.pdparams
+INFO:local_logger:Now training epoch 32. LR=0.000383
+INFO:master_logger:Now training epoch 32. LR=0.000383
+INFO:local_logger:Epoch[032/300], Step[0000/1602], Avg Loss: 4.6777, Avg Acc: 0.1250
+INFO:local_logger:Epoch[032/300], Step[0000/1602], Avg Loss: 3.5856, Avg Acc: 0.0150
+INFO:local_logger:Epoch[032/300], Step[0000/1602], Avg Loss: 3.2732, Avg Acc: 0.0000
+INFO:local_logger:Epoch[032/300], Step[0000/1602], Avg Loss: 4.4387, Avg Acc: 0.2700
+INFO:master_logger:Epoch[032/300], Step[0000/1602], Avg Loss: 3.9938, Avg Acc: 0.1025
+INFO:local_logger:Epoch[032/300], Step[0050/1602], Avg Loss: 4.0993, Avg Acc: 0.1818
+INFO:local_logger:Epoch[032/300], Step[0050/1602], Avg Loss: 4.1302, Avg Acc: 0.2114
+INFO:local_logger:Epoch[032/300], Step[0050/1602], Avg Loss: 4.1539, Avg Acc: 0.1533
+INFO:local_logger:Epoch[032/300], Step[0050/1602], Avg Loss: 4.1802, Avg Acc: 0.1958
+INFO:master_logger:Epoch[032/300], Step[0050/1602], Avg Loss: 4.1409, Avg Acc: 0.1856
+INFO:local_logger:Epoch[032/300], Step[0100/1602], Avg Loss: 4.2667, Avg Acc: 0.1652
+INFO:local_logger:Epoch[032/300], Step[0100/1602], Avg Loss: 4.1054, Avg Acc: 0.1968
+INFO:local_logger:Epoch[032/300], Step[0100/1602], Avg Loss: 4.1158, Avg Acc: 0.1908
+INFO:local_logger:Epoch[032/300], Step[0100/1602], Avg Loss: 4.2272, Avg Acc: 0.1681
+INFO:master_logger:Epoch[032/300], Step[0100/1602], Avg Loss: 4.1788, Avg Acc: 0.1802
+INFO:local_logger:Epoch[032/300], Step[0150/1602], Avg Loss: 4.1180, Avg Acc: 0.1930
+INFO:local_logger:Epoch[032/300], Step[0150/1602], Avg Loss: 4.1404, Avg Acc: 0.1891
+INFO:local_logger:Epoch[032/300], Step[0150/1602], Avg Loss: 4.2144, Avg Acc: 0.1814
+INFO:local_logger:Epoch[032/300], Step[0150/1602], Avg Loss: 4.2376, Avg Acc: 0.1794
+INFO:master_logger:Epoch[032/300], Step[0150/1602], Avg Loss: 4.1776, Avg Acc: 0.1857
+INFO:local_logger:Epoch[032/300], Step[0200/1602], Avg Loss: 4.1399, Avg Acc: 0.1900
+INFO:local_logger:Epoch[032/300], Step[0200/1602], Avg Loss: 4.1986, Avg Acc: 0.1837
+INFO:local_logger:Epoch[032/300], Step[0200/1602], Avg Loss: 4.1528, Avg Acc: 0.1915
+INFO:local_logger:Epoch[032/300], Step[0200/1602], Avg Loss: 4.2355, Avg Acc: 0.1850
+INFO:master_logger:Epoch[032/300], Step[0200/1602], Avg Loss: 4.1817, Avg Acc: 0.1876
+INFO:local_logger:Epoch[032/300], Step[0250/1602], Avg Loss: 4.1449, Avg Acc: 0.1958
+INFO:local_logger:Epoch[032/300], Step[0250/1602], Avg Loss: 4.1711, Avg Acc: 0.1904
+INFO:local_logger:Epoch[032/300], Step[0250/1602], Avg Loss: 4.1767, Avg Acc: 0.1874
+INFO:local_logger:Epoch[032/300], Step[0250/1602], Avg Loss: 4.2065, Avg Acc: 0.1935
+INFO:master_logger:Epoch[032/300], Step[0250/1602], Avg Loss: 4.1748, Avg Acc: 0.1918
+INFO:local_logger:Epoch[032/300], Step[0300/1602], Avg Loss: 4.1820, Avg Acc: 0.1882
+INFO:local_logger:Epoch[032/300], Step[0300/1602], Avg Loss: 4.1534, Avg Acc: 0.1926
+INFO:local_logger:Epoch[032/300], Step[0300/1602], Avg Loss: 4.2057, Avg Acc: 0.1906
+INFO:local_logger:Epoch[032/300], Step[0300/1602], Avg Loss: 4.1649, Avg Acc: 0.1859
+INFO:master_logger:Epoch[032/300], Step[0300/1602], Avg Loss: 4.1765, Avg Acc: 0.1893
+INFO:local_logger:Epoch[032/300], Step[0350/1602], Avg Loss: 4.1470, Avg Acc: 0.1920
+INFO:master_logger:Epoch[032/300], Step[0350/1602], Avg Loss: 4.1730, Avg Acc: 0.1886
+INFO:local_logger:Epoch[032/300], Step[0350/1602], Avg Loss: 4.1747, Avg Acc: 0.1858
+INFO:local_logger:Epoch[032/300], Step[0350/1602], Avg Loss: 4.2049, Avg Acc: 0.1880
+INFO:local_logger:Epoch[032/300], Step[0350/1602], Avg Loss: 4.1656, Avg Acc: 0.1885
+INFO:local_logger:Epoch[032/300], Step[0400/1602], Avg Loss: 4.1746, Avg Acc: 0.1844
+INFO:local_logger:Epoch[032/300], Step[0400/1602], Avg Loss: 4.1417, Avg Acc: 0.1988
+INFO:local_logger:Epoch[032/300], Step[0400/1602], Avg Loss: 4.1956, Avg Acc: 0.1901
+INFO:master_logger:Epoch[032/300], Step[0400/1602], Avg Loss: 4.1695, Avg Acc: 0.1895
+INFO:local_logger:Epoch[032/300], Step[0400/1602], Avg Loss: 4.1662, Avg Acc: 0.1848
+INFO:local_logger:Epoch[032/300], Step[0450/1602], Avg Loss: 4.1723, Avg Acc: 0.1828
+INFO:local_logger:Epoch[032/300], Step[0450/1602], Avg Loss: 4.1933, Avg Acc: 0.1878
+INFO:local_logger:Epoch[032/300], Step[0450/1602], Avg Loss: 4.1335, Avg Acc: 0.2002
+INFO:local_logger:Epoch[032/300], Step[0450/1602], Avg Loss: 4.1496, Avg Acc: 0.1846
+INFO:master_logger:Epoch[032/300], Step[0450/1602], Avg Loss: 4.1622, Avg Acc: 0.1888
+INFO:local_logger:Epoch[032/300], Step[0500/1602], Avg Loss: 4.1421, Avg Acc: 0.2005
+INFO:local_logger:Epoch[032/300], Step[0500/1602], Avg Loss: 4.1591, Avg Acc: 0.1817
+INFO:local_logger:Epoch[032/300], Step[0500/1602], Avg Loss: 4.1584, Avg Acc: 0.1844
+INFO:local_logger:Epoch[032/300], Step[0500/1602], Avg Loss: 4.1859, Avg Acc: 0.1883
+INFO:master_logger:Epoch[032/300], Step[0500/1602], Avg Loss: 4.1614, Avg Acc: 0.1887
+INFO:local_logger:Epoch[032/300], Step[0550/1602], Avg Loss: 4.1723, Avg Acc: 0.1832
+INFO:local_logger:Epoch[032/300], Step[0550/1602], Avg Loss: 4.1552, Avg Acc: 0.1858
+INFO:local_logger:Epoch[032/300], Step[0550/1602], Avg Loss: 4.1393, Avg Acc: 0.2016
+INFO:local_logger:Epoch[032/300], Step[0550/1602], Avg Loss: 4.1851, Avg Acc: 0.1864
+INFO:master_logger:Epoch[032/300], Step[0550/1602], Avg Loss: 4.1630, Avg Acc: 0.1892
+INFO:local_logger:Epoch[032/300], Step[0600/1602], Avg Loss: 4.1606, Avg Acc: 0.1877
+INFO:local_logger:Epoch[032/300], Step[0600/1602], Avg Loss: 4.1518, Avg Acc: 0.1995
+INFO:local_logger:Epoch[032/300], Step[0600/1602], Avg Loss: 4.1511, Avg Acc: 0.1873
+INFO:local_logger:Epoch[032/300], Step[0600/1602], Avg Loss: 4.1821, Avg Acc: 0.1886
+INFO:master_logger:Epoch[032/300], Step[0600/1602], Avg Loss: 4.1614, Avg Acc: 0.1908
+INFO:local_logger:Epoch[032/300], Step[0650/1602], Avg Loss: 4.1523, Avg Acc: 0.1995
+INFO:local_logger:Epoch[032/300], Step[0650/1602], Avg Loss: 4.1604, Avg Acc: 0.1869
+INFO:local_logger:Epoch[032/300], Step[0650/1602], Avg Loss: 4.1911, Avg Acc: 0.1897
+INFO:master_logger:Epoch[032/300], Step[0650/1602], Avg Loss: 4.1674, Avg Acc: 0.1909
+INFO:local_logger:Epoch[032/300], Step[0650/1602], Avg Loss: 4.1656, Avg Acc: 0.1877
+INFO:local_logger:Epoch[032/300], Step[0700/1602], Avg Loss: 4.1484, Avg Acc: 0.1989
+INFO:local_logger:Epoch[032/300], Step[0700/1602], Avg Loss: 4.1604, Avg Acc: 0.1888
+INFO:local_logger:Epoch[032/300], Step[0700/1602], Avg Loss: 4.1900, Avg Acc: 0.1875
+INFO:master_logger:Epoch[032/300], Step[0700/1602], Avg Loss: 4.1685, Avg Acc: 0.1906
+INFO:local_logger:Epoch[032/300], Step[0700/1602], Avg Loss: 4.1753, Avg Acc: 0.1872
+INFO:local_logger:Epoch[032/300], Step[0750/1602], Avg Loss: 4.1859, Avg Acc: 0.1875
+INFO:local_logger:Epoch[032/300], Step[0750/1602], Avg Loss: 4.1544, Avg Acc: 0.1966
+INFO:local_logger:Epoch[032/300], Step[0750/1602], Avg Loss: 4.1572, Avg Acc: 0.1907
+INFO:master_logger:Epoch[032/300], Step[0750/1602], Avg Loss: 4.1677, Avg Acc: 0.1907
+INFO:local_logger:Epoch[032/300], Step[0750/1602], Avg Loss: 4.1732, Avg Acc: 0.1880
+INFO:local_logger:Epoch[032/300], Step[0800/1602], Avg Loss: 4.1531, Avg Acc: 0.1950
+INFO:local_logger:Epoch[032/300], Step[0800/1602], Avg Loss: 4.1679, Avg Acc: 0.1893
+INFO:local_logger:Epoch[032/300], Step[0800/1602], Avg Loss: 4.1649, Avg Acc: 0.1891
+INFO:local_logger:Epoch[032/300], Step[0800/1602], Avg Loss: 4.1851, Avg Acc: 0.1868
+INFO:master_logger:Epoch[032/300], Step[0800/1602], Avg Loss: 4.1678, Avg Acc: 0.1900
+INFO:local_logger:Epoch[032/300], Step[0850/1602], Avg Loss: 4.1561, Avg Acc: 0.1958
+INFO:local_logger:Epoch[032/300], Step[0850/1602], Avg Loss: 4.1696, Avg Acc: 0.1891
+INFO:local_logger:Epoch[032/300], Step[0850/1602], Avg Loss: 4.1918, Avg Acc: 0.1868
+INFO:local_logger:Epoch[032/300], Step[0850/1602], Avg Loss: 4.1689, Avg Acc: 0.1899
+INFO:master_logger:Epoch[032/300], Step[0850/1602], Avg Loss: 4.1716, Avg Acc: 0.1904
+INFO:local_logger:Epoch[032/300], Step[0900/1602], Avg Loss: 4.1545, Avg Acc: 0.1947
+INFO:local_logger:Epoch[032/300], Step[0900/1602], Avg Loss: 4.1643, Avg Acc: 0.1914
+INFO:local_logger:Epoch[032/300], Step[0900/1602], Avg Loss: 4.1943, Avg Acc: 0.1856
+INFO:local_logger:Epoch[032/300], Step[0900/1602], Avg Loss: 4.1664, Avg Acc: 0.1910
+INFO:master_logger:Epoch[032/300], Step[0900/1602], Avg Loss: 4.1699, Avg Acc: 0.1907
+INFO:local_logger:Epoch[032/300], Step[0950/1602], Avg Loss: 4.1836, Avg Acc: 0.1875
+INFO:local_logger:Epoch[032/300], Step[0950/1602], Avg Loss: 4.1509, Avg Acc: 0.1934
+INFO:local_logger:Epoch[032/300], Step[0950/1602], Avg Loss: 4.1581, Avg Acc: 0.1931
+INFO:local_logger:Epoch[032/300], Step[0950/1602], Avg Loss: 4.1644, Avg Acc: 0.1915
+INFO:master_logger:Epoch[032/300], Step[0950/1602], Avg Loss: 4.1642, Avg Acc: 0.1914
+INFO:local_logger:Epoch[032/300], Step[1000/1602], Avg Loss: 4.1610, Avg Acc: 0.1919
+INFO:local_logger:Epoch[032/300], Step[1000/1602], Avg Loss: 4.1425, Avg Acc: 0.1963
+INFO:local_logger:Epoch[032/300], Step[1000/1602], Avg Loss: 4.1828, Avg Acc: 0.1860
+INFO:local_logger:Epoch[032/300], Step[1000/1602], Avg Loss: 4.1593, Avg Acc: 0.1932
+INFO:master_logger:Epoch[032/300], Step[1000/1602], Avg Loss: 4.1614, Avg Acc: 0.1919
+INFO:local_logger:Epoch[032/300], Step[1050/1602], Avg Loss: 4.1486, Avg Acc: 0.1959
+INFO:local_logger:Epoch[032/300], Step[1050/1602], Avg Loss: 4.1813, Avg Acc: 0.1855
+INFO:local_logger:Epoch[032/300], Step[1050/1602], Avg Loss: 4.1601, Avg Acc: 0.1928
+INFO:master_logger:Epoch[032/300], Step[1050/1602], Avg Loss: 4.1620, Avg Acc: 0.1915
+INFO:local_logger:Epoch[032/300], Step[1050/1602], Avg Loss: 4.1581, Avg Acc: 0.1918
+INFO:local_logger:Epoch[032/300], Step[1100/1602], Avg Loss: 4.1622, Avg Acc: 0.1924
+INFO:local_logger:Epoch[032/300], Step[1100/1602], Avg Loss: 4.1560, Avg Acc: 0.1932
+INFO:local_logger:Epoch[032/300], Step[1100/1602], Avg Loss: 4.1503, Avg Acc: 0.1967
+INFO:local_logger:Epoch[032/300], Step[1100/1602], Avg Loss: 4.1818, Avg Acc: 0.1853
+INFO:master_logger:Epoch[032/300], Step[1100/1602], Avg Loss: 4.1626, Avg Acc: 0.1919
+INFO:local_logger:Epoch[032/300], Step[1150/1602], Avg Loss: 4.1501, Avg Acc: 0.1959
+INFO:local_logger:Epoch[032/300], Step[1150/1602], Avg Loss: 4.1555, Avg Acc: 0.1932
+INFO:local_logger:Epoch[032/300], Step[1150/1602], Avg Loss: 4.1749, Avg Acc: 0.1879
+INFO:local_logger:Epoch[032/300], Step[1150/1602], Avg Loss: 4.1591, Avg Acc: 0.1927
+INFO:master_logger:Epoch[032/300], Step[1150/1602], Avg Loss: 4.1599, Avg Acc: 0.1924
+INFO:local_logger:Epoch[032/300], Step[1200/1602], Avg Loss: 4.1527, Avg Acc: 0.1962
+INFO:local_logger:Epoch[032/300], Step[1200/1602], Avg Loss: 4.1571, Avg Acc: 0.1944
+INFO:local_logger:Epoch[032/300], Step[1200/1602], Avg Loss: 4.1698, Avg Acc: 0.1896
+INFO:local_logger:Epoch[032/300], Step[1200/1602], Avg Loss: 4.1514, Avg Acc: 0.1937
+INFO:master_logger:Epoch[032/300], Step[1200/1602], Avg Loss: 4.1577, Avg Acc: 0.1935
+INFO:local_logger:Epoch[032/300], Step[1250/1602], Avg Loss: 4.1590, Avg Acc: 0.1950
+INFO:local_logger:Epoch[032/300], Step[1250/1602], Avg Loss: 4.1546, Avg Acc: 0.1965
+INFO:local_logger:Epoch[032/300], Step[1250/1602], Avg Loss: 4.1557, Avg Acc: 0.1944
+INFO:local_logger:Epoch[032/300], Step[1250/1602], Avg Loss: 4.1682, Avg Acc: 0.1891
+INFO:master_logger:Epoch[032/300], Step[1250/1602], Avg Loss: 4.1594, Avg Acc: 0.1937
+INFO:local_logger:Epoch[032/300], Step[1300/1602], Avg Loss: 4.1538, Avg Acc: 0.1969
+INFO:local_logger:Epoch[032/300], Step[1300/1602], Avg Loss: 4.1545, Avg Acc: 0.1958
+INFO:local_logger:Epoch[032/300], Step[1300/1602], Avg Loss: 4.1564, Avg Acc: 0.1941
+INFO:local_logger:Epoch[032/300], Step[1300/1602], Avg Loss: 4.1688, Avg Acc: 0.1887
+INFO:master_logger:Epoch[032/300], Step[1300/1602], Avg Loss: 4.1584, Avg Acc: 0.1939
+INFO:local_logger:Epoch[032/300], Step[1350/1602], Avg Loss: 4.1504, Avg Acc: 0.1980
+INFO:local_logger:Epoch[032/300], Step[1350/1602], Avg Loss: 4.1548, Avg Acc: 0.1947
+INFO:local_logger:Epoch[032/300], Step[1350/1602], Avg Loss: 4.1725, Avg Acc: 0.1885
+INFO:local_logger:Epoch[032/300], Step[1350/1602], Avg Loss: 4.1543, Avg Acc: 0.1942
+INFO:master_logger:Epoch[032/300], Step[1350/1602], Avg Loss: 4.1580, Avg Acc: 0.1939
+INFO:local_logger:Epoch[032/300], Step[1400/1602], Avg Loss: 4.1522, Avg Acc: 0.1943
+INFO:local_logger:Epoch[032/300], Step[1400/1602], Avg Loss: 4.1473, Avg Acc: 0.1982
+INFO:local_logger:Epoch[032/300], Step[1400/1602], Avg Loss: 4.1525, Avg Acc: 0.1941
+INFO:local_logger:Epoch[032/300], Step[1400/1602], Avg Loss: 4.1727, Avg Acc: 0.1877
+INFO:master_logger:Epoch[032/300], Step[1400/1602], Avg Loss: 4.1562, Avg Acc: 0.1936
+INFO:local_logger:Epoch[032/300], Step[1450/1602], Avg Loss: 4.1479, Avg Acc: 0.1977
+INFO:local_logger:Epoch[032/300], Step[1450/1602], Avg Loss: 4.1774, Avg Acc: 0.1876
+INFO:local_logger:Epoch[032/300], Step[1450/1602], Avg Loss: 4.1505, Avg Acc: 0.1951
+INFO:local_logger:Epoch[032/300], Step[1450/1602], Avg Loss: 4.1487, Avg Acc: 0.1944
+INFO:master_logger:Epoch[032/300], Step[1450/1602], Avg Loss: 4.1561, Avg Acc: 0.1937
+INFO:local_logger:Epoch[032/300], Step[1500/1602], Avg Loss: 4.1520, Avg Acc: 0.1975
+INFO:local_logger:Epoch[032/300], Step[1500/1602], Avg Loss: 4.1484, Avg Acc: 0.1950
+INFO:local_logger:Epoch[032/300], Step[1500/1602], Avg Loss: 4.1764, Avg Acc: 0.1876
+INFO:local_logger:Epoch[032/300], Step[1500/1602], Avg Loss: 4.1508, Avg Acc: 0.1948
+INFO:master_logger:Epoch[032/300], Step[1500/1602], Avg Loss: 4.1569, Avg Acc: 0.1937
+INFO:local_logger:Epoch[032/300], Step[1550/1602], Avg Loss: 4.1455, Avg Acc: 0.1950
+INFO:local_logger:Epoch[032/300], Step[1550/1602], Avg Loss: 4.1514, Avg Acc: 0.1983
+INFO:local_logger:Epoch[032/300], Step[1550/1602], Avg Loss: 4.1507, Avg Acc: 0.1958
+INFO:local_logger:Epoch[032/300], Step[1550/1602], Avg Loss: 4.1763, Avg Acc: 0.1879
+INFO:master_logger:Epoch[032/300], Step[1550/1602], Avg Loss: 4.1560, Avg Acc: 0.1943
+INFO:local_logger:Epoch[032/300], Step[1600/1602], Avg Loss: 4.1521, Avg Acc: 0.1990
+INFO:local_logger:Epoch[032/300], Step[1600/1602], Avg Loss: 4.1454, Avg Acc: 0.1943
+INFO:local_logger:Epoch[032/300], Step[1600/1602], Avg Loss: 4.1536, Avg Acc: 0.1959
+INFO:master_logger:Epoch[032/300], Step[1600/1602], Avg Loss: 4.1568, Avg Acc: 0.1942
+INFO:local_logger:Epoch[032/300], Step[1600/1602], Avg Loss: 4.1760, Avg Acc: 0.1874
+INFO:local_logger:----- Epoch[032/300], Train Loss: 4.1455, Train Acc: 0.1943, time: 3701.97
+INFO:local_logger:Now training epoch 33. LR=0.000382
+INFO:local_logger:----- Epoch[032/300], Train Loss: 4.1761, Train Acc: 0.1874, time: 3701.92
+INFO:local_logger:Now training epoch 33. LR=0.000382
+INFO:local_logger:----- Epoch[032/300], Train Loss: 4.1523, Train Acc: 0.1990, time: 3701.32
+INFO:master_logger:----- Epoch[032/300], Train Loss: 4.1568, Train Acc: 0.1941, time: 3701.32
+INFO:local_logger:----- Epoch[032/300], Train Loss: 4.1534, Train Acc: 0.1958, time: 3701.57
+INFO:local_logger:Now training epoch 33. LR=0.000382
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-32-Loss-4.152295494702486.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-32-Loss-4.152295494702486.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-32-Loss-4.152295494702486-EMA.pdparams
+INFO:local_logger:Now training epoch 33. LR=0.000382
+INFO:master_logger:Now training epoch 33. LR=0.000382
+INFO:local_logger:Epoch[033/300], Step[0000/1602], Avg Loss: 4.5783, Avg Acc: 0.0500
+INFO:local_logger:Epoch[033/300], Step[0000/1602], Avg Loss: 4.4464, Avg Acc: 0.0050
+INFO:local_logger:Epoch[033/300], Step[0000/1602], Avg Loss: 4.7177, Avg Acc: 0.2150
+INFO:local_logger:Epoch[033/300], Step[0000/1602], Avg Loss: 3.6171, Avg Acc: 0.4100
+INFO:master_logger:Epoch[033/300], Step[0000/1602], Avg Loss: 4.3399, Avg Acc: 0.1700
+INFO:local_logger:Epoch[033/300], Step[0050/1602], Avg Loss: 4.0910, Avg Acc: 0.1713
+INFO:local_logger:Epoch[033/300], Step[0050/1602], Avg Loss: 4.2311, Avg Acc: 0.2147
+INFO:local_logger:Epoch[033/300], Step[0050/1602], Avg Loss: 4.1419, Avg Acc: 0.1970
+INFO:master_logger:Epoch[033/300], Step[0050/1602], Avg Loss: 4.1437, Avg Acc: 0.1966
+INFO:local_logger:Epoch[033/300], Step[0050/1602], Avg Loss: 4.1107, Avg Acc: 0.2035
+INFO:local_logger:Epoch[033/300], Step[0100/1602], Avg Loss: 4.1213, Avg Acc: 0.1971
+INFO:local_logger:Epoch[033/300], Step[0100/1602], Avg Loss: 4.1424, Avg Acc: 0.1952
+INFO:local_logger:Epoch[033/300], Step[0100/1602], Avg Loss: 4.1001, Avg Acc: 0.1922
+INFO:local_logger:Epoch[033/300], Step[0100/1602], Avg Loss: 4.1288, Avg Acc: 0.2164
+INFO:master_logger:Epoch[033/300], Step[0100/1602], Avg Loss: 4.1232, Avg Acc: 0.2002
+INFO:local_logger:Epoch[033/300], Step[0150/1602], Avg Loss: 4.1292, Avg Acc: 0.1930
+INFO:local_logger:Epoch[033/300], Step[0150/1602], Avg Loss: 4.1383, Avg Acc: 0.1866
+INFO:local_logger:Epoch[033/300], Step[0150/1602], Avg Loss: 4.1257, Avg Acc: 0.2212
+INFO:local_logger:Epoch[033/300], Step[0150/1602], Avg Loss: 4.1190, Avg Acc: 0.2046
+INFO:master_logger:Epoch[033/300], Step[0150/1602], Avg Loss: 4.1280, Avg Acc: 0.2014
+INFO:local_logger:Epoch[033/300], Step[0200/1602], Avg Loss: 4.1156, Avg Acc: 0.1990
+INFO:master_logger:Epoch[033/300], Step[0200/1602], Avg Loss: 4.1479, Avg Acc: 0.2003
+INFO:local_logger:Epoch[033/300], Step[0200/1602], Avg Loss: 4.1643, Avg Acc: 0.1849
+INFO:local_logger:Epoch[033/300], Step[0200/1602], Avg Loss: 4.1691, Avg Acc: 0.2173
+INFO:local_logger:Epoch[033/300], Step[0200/1602], Avg Loss: 4.1428, Avg Acc: 0.2001
+INFO:local_logger:Epoch[033/300], Step[0250/1602], Avg Loss: 4.1541, Avg Acc: 0.1967
+INFO:local_logger:Epoch[033/300], Step[0250/1602], Avg Loss: 4.1737, Avg Acc: 0.1801
+INFO:local_logger:Epoch[033/300], Step[0250/1602], Avg Loss: 4.1566, Avg Acc: 0.2079
+INFO:local_logger:Epoch[033/300], Step[0250/1602], Avg Loss: 4.1257, Avg Acc: 0.2047
+INFO:master_logger:Epoch[033/300], Step[0250/1602], Avg Loss: 4.1525, Avg Acc: 0.1974
+INFO:local_logger:Epoch[033/300], Step[0300/1602], Avg Loss: 4.1714, Avg Acc: 0.1930
+INFO:local_logger:Epoch[033/300], Step[0300/1602], Avg Loss: 4.1173, Avg Acc: 0.2066
+INFO:local_logger:Epoch[033/300], Step[0300/1602], Avg Loss: 4.1655, Avg Acc: 0.1856
+INFO:local_logger:Epoch[033/300], Step[0300/1602], Avg Loss: 4.1367, Avg Acc: 0.2044
+INFO:master_logger:Epoch[033/300], Step[0300/1602], Avg Loss: 4.1477, Avg Acc: 0.1974
+INFO:local_logger:Epoch[033/300], Step[0350/1602], Avg Loss: 4.1697, Avg Acc: 0.1904
+INFO:local_logger:Epoch[033/300], Step[0350/1602], Avg Loss: 4.1608, Avg Acc: 0.1923
+INFO:local_logger:Epoch[033/300], Step[0350/1602], Avg Loss: 4.1133, Avg Acc: 0.2048
+INFO:local_logger:Epoch[033/300], Step[0350/1602], Avg Loss: 4.1443, Avg Acc: 0.1982
+INFO:master_logger:Epoch[033/300], Step[0350/1602], Avg Loss: 4.1470, Avg Acc: 0.1964
+INFO:local_logger:Epoch[033/300], Step[0400/1602], Avg Loss: 4.1276, Avg Acc: 0.1991
+INFO:local_logger:Epoch[033/300], Step[0400/1602], Avg Loss: 4.1694, Avg Acc: 0.1953
+INFO:local_logger:Epoch[033/300], Step[0400/1602], Avg Loss: 4.1638, Avg Acc: 0.1892
+INFO:local_logger:Epoch[033/300], Step[0400/1602], Avg Loss: 4.1243, Avg Acc: 0.2049
+INFO:master_logger:Epoch[033/300], Step[0400/1602], Avg Loss: 4.1463, Avg Acc: 0.1971
+INFO:local_logger:Epoch[033/300], Step[0450/1602], Avg Loss: 4.1564, Avg Acc: 0.1903
+INFO:local_logger:Epoch[033/300], Step[0450/1602], Avg Loss: 4.1114, Avg Acc: 0.2008
+INFO:local_logger:Epoch[033/300], Step[0450/1602], Avg Loss: 4.1802, Avg Acc: 0.1920
+INFO:local_logger:Epoch[033/300], Step[0450/1602], Avg Loss: 4.1313, Avg Acc: 0.1975
+INFO:master_logger:Epoch[033/300], Step[0450/1602], Avg Loss: 4.1448, Avg Acc: 0.1951
+INFO:local_logger:Epoch[033/300], Step[0500/1602], Avg Loss: 4.1638, Avg Acc: 0.1910
+INFO:master_logger:Epoch[033/300], Step[0500/1602], Avg Loss: 4.1484, Avg Acc: 0.1943
+INFO:local_logger:Epoch[033/300], Step[0500/1602], Avg Loss: 4.1212, Avg Acc: 0.1972
+INFO:local_logger:Epoch[033/300], Step[0500/1602], Avg Loss: 4.1360, Avg Acc: 0.1976
+INFO:local_logger:Epoch[033/300], Step[0500/1602], Avg Loss: 4.1726, Avg Acc: 0.1915
+INFO:local_logger:Epoch[033/300], Step[0550/1602], Avg Loss: 4.1669, Avg Acc: 0.1919
+INFO:local_logger:Epoch[033/300], Step[0550/1602], Avg Loss: 4.1301, Avg Acc: 0.1938
+INFO:local_logger:Epoch[033/300], Step[0550/1602], Avg Loss: 4.1584, Avg Acc: 0.1930
+INFO:local_logger:Epoch[033/300], Step[0550/1602], Avg Loss: 4.1308, Avg Acc: 0.1967
+INFO:master_logger:Epoch[033/300], Step[0550/1602], Avg Loss: 4.1466, Avg Acc: 0.1938
+INFO:local_logger:Epoch[033/300], Step[0600/1602], Avg Loss: 4.1284, Avg Acc: 0.1938
+INFO:local_logger:Epoch[033/300], Step[0600/1602], Avg Loss: 4.1671, Avg Acc: 0.1935
+INFO:local_logger:Epoch[033/300], Step[0600/1602], Avg Loss: 4.1398, Avg Acc: 0.1977
+INFO:local_logger:Epoch[033/300], Step[0600/1602], Avg Loss: 4.1338, Avg Acc: 0.1981
+INFO:master_logger:Epoch[033/300], Step[0600/1602], Avg Loss: 4.1423, Avg Acc: 0.1958
+INFO:local_logger:Epoch[033/300], Step[0650/1602], Avg Loss: 4.1587, Avg Acc: 0.1956
+INFO:local_logger:Epoch[033/300], Step[0650/1602], Avg Loss: 4.1273, Avg Acc: 0.1913
+INFO:local_logger:Epoch[033/300], Step[0650/1602], Avg Loss: 4.1472, Avg Acc: 0.1952
+INFO:local_logger:Epoch[033/300], Step[0650/1602], Avg Loss: 4.1388, Avg Acc: 0.1955
+INFO:master_logger:Epoch[033/300], Step[0650/1602], Avg Loss: 4.1430, Avg Acc: 0.1944
+INFO:local_logger:Epoch[033/300], Step[0700/1602], Avg Loss: 4.1581, Avg Acc: 0.1964
+INFO:local_logger:Epoch[033/300], Step[0700/1602], Avg Loss: 4.1443, Avg Acc: 0.1953
+INFO:local_logger:Epoch[033/300], Step[0700/1602], Avg Loss: 4.1331, Avg Acc: 0.1950
+INFO:local_logger:Epoch[033/300], Step[0700/1602], Avg Loss: 4.1271, Avg Acc: 0.1910
+INFO:master_logger:Epoch[033/300], Step[0700/1602], Avg Loss: 4.1406, Avg Acc: 0.1944
+INFO:local_logger:Epoch[033/300], Step[0750/1602], Avg Loss: 4.1340, Avg Acc: 0.1965
+INFO:local_logger:Epoch[033/300], Step[0750/1602], Avg Loss: 4.1510, Avg Acc: 0.1965
+INFO:local_logger:Epoch[033/300], Step[0750/1602], Avg Loss: 4.1203, Avg Acc: 0.1914
+INFO:local_logger:Epoch[033/300], Step[0750/1602], Avg Loss: 4.1381, Avg Acc: 0.1936
+INFO:master_logger:Epoch[033/300], Step[0750/1602], Avg Loss: 4.1359, Avg Acc: 0.1945
+INFO:local_logger:Epoch[033/300], Step[0800/1602], Avg Loss: 4.1200, Avg Acc: 0.1927
+INFO:local_logger:Epoch[033/300], Step[0800/1602], Avg Loss: 4.1497, Avg Acc: 0.1964
+INFO:local_logger:Epoch[033/300], Step[0800/1602], Avg Loss: 4.1327, Avg Acc: 0.1964
+INFO:local_logger:Epoch[033/300], Step[0800/1602], Avg Loss: 4.1349, Avg Acc: 0.1942
+INFO:master_logger:Epoch[033/300], Step[0800/1602], Avg Loss: 4.1343, Avg Acc: 0.1949
+INFO:local_logger:Epoch[033/300], Step[0850/1602], Avg Loss: 4.1158, Avg Acc: 0.1922
+INFO:local_logger:Epoch[033/300], Step[0850/1602], Avg Loss: 4.1498, Avg Acc: 0.1974
+INFO:local_logger:Epoch[033/300], Step[0850/1602], Avg Loss: 4.1316, Avg Acc: 0.1977
+INFO:local_logger:Epoch[033/300], Step[0850/1602], Avg Loss: 4.1312, Avg Acc: 0.1921
+INFO:master_logger:Epoch[033/300], Step[0850/1602], Avg Loss: 4.1321, Avg Acc: 0.1949
+INFO:local_logger:Epoch[033/300], Step[0900/1602], Avg Loss: 4.1162, Avg Acc: 0.1928
+INFO:local_logger:Epoch[033/300], Step[0900/1602], Avg Loss: 4.1540, Avg Acc: 0.1963
+INFO:local_logger:Epoch[033/300], Step[0900/1602], Avg Loss: 4.1268, Avg Acc: 0.1908
+INFO:local_logger:Epoch[033/300], Step[0900/1602], Avg Loss: 4.1288, Avg Acc: 0.1986
+INFO:master_logger:Epoch[033/300], Step[0900/1602], Avg Loss: 4.1314, Avg Acc: 0.1946
+INFO:local_logger:Epoch[033/300], Step[0950/1602], Avg Loss: 4.1544, Avg Acc: 0.1970
+INFO:local_logger:Epoch[033/300], Step[0950/1602], Avg Loss: 4.1200, Avg Acc: 0.1941
+INFO:local_logger:Epoch[033/300], Step[0950/1602], Avg Loss: 4.1267, Avg Acc: 0.1899
+INFO:local_logger:Epoch[033/300], Step[0950/1602], Avg Loss: 4.1311, Avg Acc: 0.1972
+INFO:master_logger:Epoch[033/300], Step[0950/1602], Avg Loss: 4.1331, Avg Acc: 0.1945
+INFO:local_logger:Epoch[033/300], Step[1000/1602], Avg Loss: 4.1200, Avg Acc: 0.1940
+INFO:local_logger:Epoch[033/300], Step[1000/1602], Avg Loss: 4.1553, Avg Acc: 0.1959
+INFO:local_logger:Epoch[033/300], Step[1000/1602], Avg Loss: 4.1274, Avg Acc: 0.1967
+INFO:local_logger:Epoch[033/300], Step[1000/1602], Avg Loss: 4.1241, Avg Acc: 0.1911
+INFO:master_logger:Epoch[033/300], Step[1000/1602], Avg Loss: 4.1317, Avg Acc: 0.1944
+INFO:local_logger:Epoch[033/300], Step[1050/1602], Avg Loss: 4.1167, Avg Acc: 0.1943
+INFO:local_logger:Epoch[033/300], Step[1050/1602], Avg Loss: 4.1499, Avg Acc: 0.1960
+INFO:local_logger:Epoch[033/300], Step[1050/1602], Avg Loss: 4.1266, Avg Acc: 0.1925
+INFO:local_logger:Epoch[033/300], Step[1050/1602], Avg Loss: 4.1254, Avg Acc: 0.1952
+INFO:master_logger:Epoch[033/300], Step[1050/1602], Avg Loss: 4.1297, Avg Acc: 0.1945
+INFO:local_logger:Epoch[033/300], Step[1100/1602], Avg Loss: 4.1478, Avg Acc: 0.1946
+INFO:local_logger:Epoch[033/300], Step[1100/1602], Avg Loss: 4.1229, Avg Acc: 0.1929
+INFO:local_logger:Epoch[033/300], Step[1100/1602], Avg Loss: 4.1264, Avg Acc: 0.1946
+INFO:master_logger:Epoch[033/300], Step[1100/1602], Avg Loss: 4.1288, Avg Acc: 0.1942
+INFO:local_logger:Epoch[033/300], Step[1100/1602], Avg Loss: 4.1183, Avg Acc: 0.1949
+INFO:local_logger:Epoch[033/300], Step[1150/1602], Avg Loss: 4.1442, Avg Acc: 0.1951
+INFO:local_logger:Epoch[033/300], Step[1150/1602], Avg Loss: 4.1137, Avg Acc: 0.1936
+INFO:local_logger:Epoch[033/300], Step[1150/1602], Avg Loss: 4.1319, Avg Acc: 0.1950
+INFO:local_logger:Epoch[033/300], Step[1150/1602], Avg Loss: 4.1165, Avg Acc: 0.1953
+INFO:master_logger:Epoch[033/300], Step[1150/1602], Avg Loss: 4.1266, Avg Acc: 0.1948
+INFO:local_logger:Epoch[033/300], Step[1200/1602], Avg Loss: 4.1320, Avg Acc: 0.1949
+INFO:local_logger:Epoch[033/300], Step[1200/1602], Avg Loss: 4.1439, Avg Acc: 0.1971
+INFO:local_logger:Epoch[033/300], Step[1200/1602], Avg Loss: 4.1186, Avg Acc: 0.1954
+INFO:local_logger:Epoch[033/300], Step[1200/1602], Avg Loss: 4.1105, Avg Acc: 0.1939
+INFO:master_logger:Epoch[033/300], Step[1200/1602], Avg Loss: 4.1263, Avg Acc: 0.1953
+INFO:local_logger:Epoch[033/300], Step[1250/1602], Avg Loss: 4.1425, Avg Acc: 0.1978
+INFO:local_logger:Epoch[033/300], Step[1250/1602], Avg Loss: 4.1094, Avg Acc: 0.1936
+INFO:local_logger:Epoch[033/300], Step[1250/1602], Avg Loss: 4.1196, Avg Acc: 0.1946
+INFO:local_logger:Epoch[033/300], Step[1250/1602], Avg Loss: 4.1333, Avg Acc: 0.1956
+INFO:master_logger:Epoch[033/300], Step[1250/1602], Avg Loss: 4.1262, Avg Acc: 0.1954
+INFO:local_logger:Epoch[033/300], Step[1300/1602], Avg Loss: 4.1494, Avg Acc: 0.1967
+INFO:local_logger:Epoch[033/300], Step[1300/1602], Avg Loss: 4.1186, Avg Acc: 0.1948
+INFO:local_logger:Epoch[033/300], Step[1300/1602], Avg Loss: 4.1318, Avg Acc: 0.1953
+INFO:local_logger:Epoch[033/300], Step[1300/1602], Avg Loss: 4.1127, Avg Acc: 0.1935
+INFO:master_logger:Epoch[033/300], Step[1300/1602], Avg Loss: 4.1281, Avg Acc: 0.1951
+INFO:local_logger:Epoch[033/300], Step[1350/1602], Avg Loss: 4.1412, Avg Acc: 0.1972
+INFO:local_logger:Epoch[033/300], Step[1350/1602], Avg Loss: 4.1287, Avg Acc: 0.1964
+INFO:local_logger:Epoch[033/300], Step[1350/1602], Avg Loss: 4.1136, Avg Acc: 0.1934
+INFO:local_logger:Epoch[033/300], Step[1350/1602], Avg Loss: 4.1193, Avg Acc: 0.1945
+INFO:master_logger:Epoch[033/300], Step[1350/1602], Avg Loss: 4.1257, Avg Acc: 0.1954
+INFO:local_logger:Epoch[033/300], Step[1400/1602], Avg Loss: 4.1377, Avg Acc: 0.1984
+INFO:local_logger:Epoch[033/300], Step[1400/1602], Avg Loss: 4.1158, Avg Acc: 0.1945
+INFO:local_logger:Epoch[033/300], Step[1400/1602], Avg Loss: 4.1174, Avg Acc: 0.1950
+INFO:local_logger:Epoch[033/300], Step[1400/1602], Avg Loss: 4.1256, Avg Acc: 0.1961
+INFO:master_logger:Epoch[033/300], Step[1400/1602], Avg Loss: 4.1241, Avg Acc: 0.1960
+INFO:local_logger:Epoch[033/300], Step[1450/1602], Avg Loss: 4.1158, Avg Acc: 0.1950
+INFO:local_logger:Epoch[033/300], Step[1450/1602], Avg Loss: 4.1327, Avg Acc: 0.1987
+INFO:local_logger:Epoch[033/300], Step[1450/1602], Avg Loss: 4.1158, Avg Acc: 0.1934
+INFO:local_logger:Epoch[033/300], Step[1450/1602], Avg Loss: 4.1224, Avg Acc: 0.1965
+INFO:master_logger:Epoch[033/300], Step[1450/1602], Avg Loss: 4.1217, Avg Acc: 0.1959
+INFO:local_logger:Epoch[033/300], Step[1500/1602], Avg Loss: 4.1344, Avg Acc: 0.1981
+INFO:local_logger:Epoch[033/300], Step[1500/1602], Avg Loss: 4.1174, Avg Acc: 0.1930
+INFO:local_logger:Epoch[033/300], Step[1500/1602], Avg Loss: 4.1179, Avg Acc: 0.1941
+INFO:master_logger:Epoch[033/300], Step[1500/1602], Avg Loss: 4.1232, Avg Acc: 0.1955
+INFO:local_logger:Epoch[033/300], Step[1500/1602], Avg Loss: 4.1232, Avg Acc: 0.1967
+INFO:local_logger:Epoch[033/300], Step[1550/1602], Avg Loss: 4.1365, Avg Acc: 0.1985
+INFO:local_logger:Epoch[033/300], Step[1550/1602], Avg Loss: 4.1198, Avg Acc: 0.1973
+INFO:local_logger:Epoch[033/300], Step[1550/1602], Avg Loss: 4.1166, Avg Acc: 0.1935
+INFO:master_logger:Epoch[033/300], Step[1550/1602], Avg Loss: 4.1220, Avg Acc: 0.1958
+INFO:local_logger:Epoch[033/300], Step[1550/1602], Avg Loss: 4.1151, Avg Acc: 0.1937
+INFO:local_logger:Epoch[033/300], Step[1600/1602], Avg Loss: 4.1385, Avg Acc: 0.1980
+INFO:local_logger:Epoch[033/300], Step[1600/1602], Avg Loss: 4.1177, Avg Acc: 0.1941
+INFO:local_logger:Epoch[033/300], Step[1600/1602], Avg Loss: 4.1232, Avg Acc: 0.1970
+INFO:local_logger:Epoch[033/300], Step[1600/1602], Avg Loss: 4.1152, Avg Acc: 0.1938
+INFO:master_logger:Epoch[033/300], Step[1600/1602], Avg Loss: 4.1236, Avg Acc: 0.1957
+INFO:local_logger:----- Epoch[033/300], Train Loss: 4.1230, Train Acc: 0.1970, time: 3698.37
+INFO:local_logger:Now training epoch 34. LR=0.000382
+INFO:local_logger:----- Epoch[033/300], Train Loss: 4.1175, Train Acc: 0.1941, time: 3698.50
+INFO:local_logger:Now training epoch 34. LR=0.000382
+INFO:local_logger:----- Epoch[033/300], Train Loss: 4.1153, Train Acc: 0.1938, time: 3698.55
+INFO:local_logger:Now training epoch 34. LR=0.000382
+INFO:local_logger:----- Epoch[033/300], Train Loss: 4.1381, Train Acc: 0.1980, time: 3698.24
+INFO:master_logger:----- Epoch[033/300], Train Loss: 4.1235, Train Acc: 0.1957, time: 3698.24
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-33-Loss-4.138145907940862.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-33-Loss-4.138145907940862.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-33-Loss-4.138145907940862-EMA.pdparams
+INFO:local_logger:Now training epoch 34. LR=0.000382
+INFO:master_logger:Now training epoch 34. LR=0.000382
+INFO:local_logger:Epoch[034/300], Step[0000/1602], Avg Loss: 3.8925, Avg Acc: 0.3550
+INFO:local_logger:Epoch[034/300], Step[0000/1602], Avg Loss: 4.6515, Avg Acc: 0.1200
+INFO:local_logger:Epoch[034/300], Step[0000/1602], Avg Loss: 3.3189, Avg Acc: 0.4750
+INFO:local_logger:Epoch[034/300], Step[0000/1602], Avg Loss: 4.5232, Avg Acc: 0.2500
+INFO:master_logger:Epoch[034/300], Step[0000/1602], Avg Loss: 4.0965, Avg Acc: 0.3000
+INFO:local_logger:Epoch[034/300], Step[0050/1602], Avg Loss: 4.1371, Avg Acc: 0.2020
+INFO:local_logger:Epoch[034/300], Step[0050/1602], Avg Loss: 4.1020, Avg Acc: 0.2062
+INFO:local_logger:Epoch[034/300], Step[0050/1602], Avg Loss: 4.0477, Avg Acc: 0.1987
+INFO:master_logger:Epoch[034/300], Step[0050/1602], Avg Loss: 4.0911, Avg Acc: 0.2097
+INFO:local_logger:Epoch[034/300], Step[0050/1602], Avg Loss: 4.0775, Avg Acc: 0.2318
+INFO:local_logger:Epoch[034/300], Step[0100/1602], Avg Loss: 4.1297, Avg Acc: 0.1993
+INFO:local_logger:Epoch[034/300], Step[0100/1602], Avg Loss: 4.1402, Avg Acc: 0.2200
+INFO:local_logger:Epoch[034/300], Step[0100/1602], Avg Loss: 4.0482, Avg Acc: 0.2002
+INFO:local_logger:Epoch[034/300], Step[0100/1602], Avg Loss: 4.1536, Avg Acc: 0.2107
+INFO:master_logger:Epoch[034/300], Step[0100/1602], Avg Loss: 4.1179, Avg Acc: 0.2075
+INFO:local_logger:Epoch[034/300], Step[0150/1602], Avg Loss: 4.1403, Avg Acc: 0.2114
+INFO:local_logger:Epoch[034/300], Step[0150/1602], Avg Loss: 4.1381, Avg Acc: 0.2038
+INFO:local_logger:Epoch[034/300], Step[0150/1602], Avg Loss: 4.1154, Avg Acc: 0.2117
+INFO:local_logger:Epoch[034/300], Step[0150/1602], Avg Loss: 4.0670, Avg Acc: 0.2038
+INFO:master_logger:Epoch[034/300], Step[0150/1602], Avg Loss: 4.1152, Avg Acc: 0.2077
+INFO:local_logger:Epoch[034/300], Step[0200/1602], Avg Loss: 4.1593, Avg Acc: 0.2030
+INFO:local_logger:Epoch[034/300], Step[0200/1602], Avg Loss: 4.1196, Avg Acc: 0.2059
+INFO:local_logger:Epoch[034/300], Step[0200/1602], Avg Loss: 4.0924, Avg Acc: 0.2021
+INFO:local_logger:Epoch[034/300], Step[0200/1602], Avg Loss: 4.0831, Avg Acc: 0.2071
+INFO:master_logger:Epoch[034/300], Step[0200/1602], Avg Loss: 4.1136, Avg Acc: 0.2045
+INFO:local_logger:Epoch[034/300], Step[0250/1602], Avg Loss: 4.1066, Avg Acc: 0.1978
+INFO:local_logger:Epoch[034/300], Step[0250/1602], Avg Loss: 4.1093, Avg Acc: 0.2082
+INFO:local_logger:Epoch[034/300], Step[0250/1602], Avg Loss: 4.1471, Avg Acc: 0.1988
+INFO:master_logger:Epoch[034/300], Step[0250/1602], Avg Loss: 4.1172, Avg Acc: 0.2030
+INFO:local_logger:Epoch[034/300], Step[0250/1602], Avg Loss: 4.1057, Avg Acc: 0.2071
+INFO:local_logger:Epoch[034/300], Step[0300/1602], Avg Loss: 4.1605, Avg Acc: 0.1968
+INFO:local_logger:Epoch[034/300], Step[0300/1602], Avg Loss: 4.1008, Avg Acc: 0.2123
+INFO:local_logger:Epoch[034/300], Step[0300/1602], Avg Loss: 4.1221, Avg Acc: 0.2055
+INFO:local_logger:Epoch[034/300], Step[0300/1602], Avg Loss: 4.1294, Avg Acc: 0.1927
+INFO:master_logger:Epoch[034/300], Step[0300/1602], Avg Loss: 4.1282, Avg Acc: 0.2018
+INFO:local_logger:Epoch[034/300], Step[0350/1602], Avg Loss: 4.1033, Avg Acc: 0.2119
+INFO:local_logger:Epoch[034/300], Step[0350/1602], Avg Loss: 4.1222, Avg Acc: 0.1965
+INFO:local_logger:Epoch[034/300], Step[0350/1602], Avg Loss: 4.1107, Avg Acc: 0.2077
+INFO:local_logger:Epoch[034/300], Step[0350/1602], Avg Loss: 4.1421, Avg Acc: 0.1961
+INFO:master_logger:Epoch[034/300], Step[0350/1602], Avg Loss: 4.1196, Avg Acc: 0.2031
+INFO:local_logger:Epoch[034/300], Step[0400/1602], Avg Loss: 4.1025, Avg Acc: 0.2125
+INFO:local_logger:Epoch[034/300], Step[0400/1602], Avg Loss: 4.1173, Avg Acc: 0.2061
+INFO:local_logger:Epoch[034/300], Step[0400/1602], Avg Loss: 4.1404, Avg Acc: 0.1985
+INFO:local_logger:Epoch[034/300], Step[0400/1602], Avg Loss: 4.1326, Avg Acc: 0.1965
+INFO:master_logger:Epoch[034/300], Step[0400/1602], Avg Loss: 4.1232, Avg Acc: 0.2034
+INFO:local_logger:Epoch[034/300], Step[0450/1602], Avg Loss: 4.1083, Avg Acc: 0.2126
+INFO:local_logger:Epoch[034/300], Step[0450/1602], Avg Loss: 4.1338, Avg Acc: 0.1954
+INFO:local_logger:Epoch[034/300], Step[0450/1602], Avg Loss: 4.1140, Avg Acc: 0.2059
+INFO:master_logger:Epoch[034/300], Step[0450/1602], Avg Loss: 4.1232, Avg Acc: 0.2029
+INFO:local_logger:Epoch[034/300], Step[0450/1602], Avg Loss: 4.1366, Avg Acc: 0.1975
+INFO:local_logger:Epoch[034/300], Step[0500/1602], Avg Loss: 4.1162, Avg Acc: 0.2117
+INFO:local_logger:Epoch[034/300], Step[0500/1602], Avg Loss: 4.1127, Avg Acc: 0.2055
+INFO:local_logger:Epoch[034/300], Step[0500/1602], Avg Loss: 4.1375, Avg Acc: 0.1997
+INFO:local_logger:Epoch[034/300], Step[0500/1602], Avg Loss: 4.1313, Avg Acc: 0.1958
+INFO:master_logger:Epoch[034/300], Step[0500/1602], Avg Loss: 4.1244, Avg Acc: 0.2032
+INFO:local_logger:Epoch[034/300], Step[0550/1602], Avg Loss: 4.1094, Avg Acc: 0.2110
+INFO:local_logger:Epoch[034/300], Step[0550/1602], Avg Loss: 4.1330, Avg Acc: 0.2013
+INFO:local_logger:Epoch[034/300], Step[0550/1602], Avg Loss: 4.1089, Avg Acc: 0.2054
+INFO:master_logger:Epoch[034/300], Step[0550/1602], Avg Loss: 4.1203, Avg Acc: 0.2036
+INFO:local_logger:Epoch[034/300], Step[0550/1602], Avg Loss: 4.1298, Avg Acc: 0.1968
+INFO:local_logger:Epoch[034/300], Step[0600/1602], Avg Loss: 4.1010, Avg Acc: 0.2087
+INFO:local_logger:Epoch[034/300], Step[0600/1602], Avg Loss: 4.1265, Avg Acc: 0.2028
+INFO:local_logger:Epoch[034/300], Step[0600/1602], Avg Loss: 4.1040, Avg Acc: 0.2050
+INFO:local_logger:Epoch[034/300], Step[0600/1602], Avg Loss: 4.1278, Avg Acc: 0.1984
+INFO:master_logger:Epoch[034/300], Step[0600/1602], Avg Loss: 4.1148, Avg Acc: 0.2037
+INFO:local_logger:Epoch[034/300], Step[0650/1602], Avg Loss: 4.0947, Avg Acc: 0.2101
+INFO:local_logger:Epoch[034/300], Step[0650/1602], Avg Loss: 4.1236, Avg Acc: 0.2005
+INFO:local_logger:Epoch[034/300], Step[0650/1602], Avg Loss: 4.1105, Avg Acc: 0.2035
+INFO:local_logger:Epoch[034/300], Step[0650/1602], Avg Loss: 4.1190, Avg Acc: 0.2042
+INFO:master_logger:Epoch[034/300], Step[0650/1602], Avg Loss: 4.1119, Avg Acc: 0.2046
+INFO:local_logger:Epoch[034/300], Step[0700/1602], Avg Loss: 4.0898, Avg Acc: 0.2076
+INFO:local_logger:Epoch[034/300], Step[0700/1602], Avg Loss: 4.1176, Avg Acc: 0.2046
+INFO:local_logger:Epoch[034/300], Step[0700/1602], Avg Loss: 4.1155, Avg Acc: 0.1991
+INFO:local_logger:Epoch[034/300], Step[0700/1602], Avg Loss: 4.1124, Avg Acc: 0.2038
+INFO:master_logger:Epoch[034/300], Step[0700/1602], Avg Loss: 4.1088, Avg Acc: 0.2038
+INFO:local_logger:Epoch[034/300], Step[0750/1602], Avg Loss: 4.0918, Avg Acc: 0.2079
+INFO:local_logger:Epoch[034/300], Step[0750/1602], Avg Loss: 4.1164, Avg Acc: 0.1967
+INFO:local_logger:Epoch[034/300], Step[0750/1602], Avg Loss: 4.1106, Avg Acc: 0.2049
+INFO:master_logger:Epoch[034/300], Step[0750/1602], Avg Loss: 4.1095, Avg Acc: 0.2037
+INFO:local_logger:Epoch[034/300], Step[0750/1602], Avg Loss: 4.1192, Avg Acc: 0.2053
+INFO:local_logger:Epoch[034/300], Step[0800/1602], Avg Loss: 4.0980, Avg Acc: 0.2061
+INFO:local_logger:Epoch[034/300], Step[0800/1602], Avg Loss: 4.1165, Avg Acc: 0.2022
+INFO:local_logger:Epoch[034/300], Step[0800/1602], Avg Loss: 4.1131, Avg Acc: 0.1998
+INFO:local_logger:Epoch[034/300], Step[0800/1602], Avg Loss: 4.1202, Avg Acc: 0.2063
+INFO:master_logger:Epoch[034/300], Step[0800/1602], Avg Loss: 4.1119, Avg Acc: 0.2036
+INFO:local_logger:Epoch[034/300], Step[0850/1602], Avg Loss: 4.0903, Avg Acc: 0.2072
+INFO:local_logger:Epoch[034/300], Step[0850/1602], Avg Loss: 4.1137, Avg Acc: 0.1998
+INFO:local_logger:Epoch[034/300], Step[0850/1602], Avg Loss: 4.1197, Avg Acc: 0.2023
+INFO:local_logger:Epoch[034/300], Step[0850/1602], Avg Loss: 4.1104, Avg Acc: 0.2059
+INFO:master_logger:Epoch[034/300], Step[0850/1602], Avg Loss: 4.1086, Avg Acc: 0.2038
+INFO:local_logger:Epoch[034/300], Step[0900/1602], Avg Loss: 4.1141, Avg Acc: 0.2002
+INFO:local_logger:Epoch[034/300], Step[0900/1602], Avg Loss: 4.0879, Avg Acc: 0.2083
+INFO:local_logger:Epoch[034/300], Step[0900/1602], Avg Loss: 4.1134, Avg Acc: 0.2057
+INFO:local_logger:Epoch[034/300], Step[0900/1602], Avg Loss: 4.1199, Avg Acc: 0.2028
+INFO:master_logger:Epoch[034/300], Step[0900/1602], Avg Loss: 4.1088, Avg Acc: 0.2043
+INFO:local_logger:Epoch[034/300], Step[0950/1602], Avg Loss: 4.0922, Avg Acc: 0.2085
+INFO:master_logger:Epoch[034/300], Step[0950/1602], Avg Loss: 4.1102, Avg Acc: 0.2036
+INFO:local_logger:Epoch[034/300], Step[0950/1602], Avg Loss: 4.1157, Avg Acc: 0.2009
+INFO:local_logger:Epoch[034/300], Step[0950/1602], Avg Loss: 4.1187, Avg Acc: 0.1995
+INFO:local_logger:Epoch[034/300], Step[0950/1602], Avg Loss: 4.1144, Avg Acc: 0.2056
+INFO:local_logger:Epoch[034/300], Step[1000/1602], Avg Loss: 4.1174, Avg Acc: 0.1993
+INFO:local_logger:Epoch[034/300], Step[1000/1602], Avg Loss: 4.0939, Avg Acc: 0.2085
+INFO:local_logger:Epoch[034/300], Step[1000/1602], Avg Loss: 4.1134, Avg Acc: 0.2059
+INFO:local_logger:Epoch[034/300], Step[1000/1602], Avg Loss: 4.1161, Avg Acc: 0.2006
+INFO:master_logger:Epoch[034/300], Step[1000/1602], Avg Loss: 4.1102, Avg Acc: 0.2036
+INFO:local_logger:Epoch[034/300], Step[1050/1602], Avg Loss: 4.0946, Avg Acc: 0.2078
+INFO:local_logger:Epoch[034/300], Step[1050/1602], Avg Loss: 4.1141, Avg Acc: 0.2031
+INFO:local_logger:Epoch[034/300], Step[1050/1602], Avg Loss: 4.1164, Avg Acc: 0.2048
+INFO:local_logger:Epoch[034/300], Step[1050/1602], Avg Loss: 4.1220, Avg Acc: 0.1976
+INFO:master_logger:Epoch[034/300], Step[1050/1602], Avg Loss: 4.1118, Avg Acc: 0.2033
+INFO:local_logger:Epoch[034/300], Step[1100/1602], Avg Loss: 4.0896, Avg Acc: 0.2086
+INFO:local_logger:Epoch[034/300], Step[1100/1602], Avg Loss: 4.1171, Avg Acc: 0.2055
+INFO:local_logger:Epoch[034/300], Step[1100/1602], Avg Loss: 4.1168, Avg Acc: 0.2022
+INFO:local_logger:Epoch[034/300], Step[1100/1602], Avg Loss: 4.1271, Avg Acc: 0.1963
+INFO:master_logger:Epoch[034/300], Step[1100/1602], Avg Loss: 4.1126, Avg Acc: 0.2031
+INFO:local_logger:Epoch[034/300], Step[1150/1602], Avg Loss: 4.0951, Avg Acc: 0.2086
+INFO:local_logger:Epoch[034/300], Step[1150/1602], Avg Loss: 4.1145, Avg Acc: 0.2061
+INFO:local_logger:Epoch[034/300], Step[1150/1602], Avg Loss: 4.1257, Avg Acc: 0.1961
+INFO:local_logger:Epoch[034/300], Step[1150/1602], Avg Loss: 4.1147, Avg Acc: 0.2036
+INFO:master_logger:Epoch[034/300], Step[1150/1602], Avg Loss: 4.1125, Avg Acc: 0.2036
+INFO:local_logger:Epoch[034/300], Step[1200/1602], Avg Loss: 4.0995, Avg Acc: 0.2084
+INFO:local_logger:Epoch[034/300], Step[1200/1602], Avg Loss: 4.1113, Avg Acc: 0.2050
+INFO:local_logger:Epoch[034/300], Step[1200/1602], Avg Loss: 4.1138, Avg Acc: 0.2030
+INFO:master_logger:Epoch[034/300], Step[1200/1602], Avg Loss: 4.1133, Avg Acc: 0.2026
+INFO:local_logger:Epoch[034/300], Step[1200/1602], Avg Loss: 4.1286, Avg Acc: 0.1942
+INFO:local_logger:Epoch[034/300], Step[1250/1602], Avg Loss: 4.1029, Avg Acc: 0.2073
+INFO:local_logger:Epoch[034/300], Step[1250/1602], Avg Loss: 4.1272, Avg Acc: 0.1944
+INFO:local_logger:Epoch[034/300], Step[1250/1602], Avg Loss: 4.1179, Avg Acc: 0.2040
+INFO:local_logger:Epoch[034/300], Step[1250/1602], Avg Loss: 4.1161, Avg Acc: 0.2020
+INFO:master_logger:Epoch[034/300], Step[1250/1602], Avg Loss: 4.1160, Avg Acc: 0.2019
+INFO:local_logger:Epoch[034/300], Step[1300/1602], Avg Loss: 4.1009, Avg Acc: 0.2073
+INFO:local_logger:Epoch[034/300], Step[1300/1602], Avg Loss: 4.1118, Avg Acc: 0.2034
+INFO:local_logger:Epoch[034/300], Step[1300/1602], Avg Loss: 4.1199, Avg Acc: 0.2033
+INFO:local_logger:Epoch[034/300], Step[1300/1602], Avg Loss: 4.1213, Avg Acc: 0.1961
+INFO:master_logger:Epoch[034/300], Step[1300/1602], Avg Loss: 4.1135, Avg Acc: 0.2025
+INFO:local_logger:Epoch[034/300], Step[1350/1602], Avg Loss: 4.1031, Avg Acc: 0.2074
+INFO:local_logger:Epoch[034/300], Step[1350/1602], Avg Loss: 4.1111, Avg Acc: 0.2014
+INFO:local_logger:Epoch[034/300], Step[1350/1602], Avg Loss: 4.1070, Avg Acc: 0.2039
+INFO:master_logger:Epoch[034/300], Step[1350/1602], Avg Loss: 4.1113, Avg Acc: 0.2022
+INFO:local_logger:Epoch[034/300], Step[1350/1602], Avg Loss: 4.1239, Avg Acc: 0.1961
+INFO:local_logger:Epoch[034/300], Step[1400/1602], Avg Loss: 4.0984, Avg Acc: 0.2075
+INFO:local_logger:Epoch[034/300], Step[1400/1602], Avg Loss: 4.1097, Avg Acc: 0.2017
+INFO:local_logger:Epoch[034/300], Step[1400/1602], Avg Loss: 4.1274, Avg Acc: 0.1968
+INFO:local_logger:Epoch[034/300], Step[1400/1602], Avg Loss: 4.1093, Avg Acc: 0.2029
+INFO:master_logger:Epoch[034/300], Step[1400/1602], Avg Loss: 4.1112, Avg Acc: 0.2022
+INFO:local_logger:Epoch[034/300], Step[1450/1602], Avg Loss: 4.1025, Avg Acc: 0.2073
+INFO:local_logger:Epoch[034/300], Step[1450/1602], Avg Loss: 4.1109, Avg Acc: 0.2016
+INFO:local_logger:Epoch[034/300], Step[1450/1602], Avg Loss: 4.1132, Avg Acc: 0.2029
+INFO:local_logger:Epoch[034/300], Step[1450/1602], Avg Loss: 4.1253, Avg Acc: 0.1972
+INFO:master_logger:Epoch[034/300], Step[1450/1602], Avg Loss: 4.1130, Avg Acc: 0.2023
+INFO:local_logger:Epoch[034/300], Step[1500/1602], Avg Loss: 4.1060, Avg Acc: 0.2070
+INFO:local_logger:Epoch[034/300], Step[1500/1602], Avg Loss: 4.1178, Avg Acc: 0.2022
+INFO:local_logger:Epoch[034/300], Step[1500/1602], Avg Loss: 4.1130, Avg Acc: 0.2016
+INFO:local_logger:Epoch[034/300], Step[1500/1602], Avg Loss: 4.1276, Avg Acc: 0.1974
+INFO:master_logger:Epoch[034/300], Step[1500/1602], Avg Loss: 4.1161, Avg Acc: 0.2020
+INFO:local_logger:Epoch[034/300], Step[1550/1602], Avg Loss: 4.0992, Avg Acc: 0.2073
+INFO:local_logger:Epoch[034/300], Step[1550/1602], Avg Loss: 4.1201, Avg Acc: 0.2004
+INFO:local_logger:Epoch[034/300], Step[1550/1602], Avg Loss: 4.1111, Avg Acc: 0.2011
+INFO:master_logger:Epoch[034/300], Step[1550/1602], Avg Loss: 4.1141, Avg Acc: 0.2014
+INFO:local_logger:Epoch[034/300], Step[1550/1602], Avg Loss: 4.1259, Avg Acc: 0.1969
+INFO:local_logger:Epoch[034/300], Step[1600/1602], Avg Loss: 4.0994, Avg Acc: 0.2078
+INFO:local_logger:Epoch[034/300], Step[1600/1602], Avg Loss: 4.1279, Avg Acc: 0.1975
+INFO:local_logger:Epoch[034/300], Step[1600/1602], Avg Loss: 4.1199, Avg Acc: 0.2002
+INFO:local_logger:Epoch[034/300], Step[1600/1602], Avg Loss: 4.1088, Avg Acc: 0.2005
+INFO:master_logger:Epoch[034/300], Step[1600/1602], Avg Loss: 4.1140, Avg Acc: 0.2015
+INFO:local_logger:----- Epoch[034/300], Train Loss: 4.0996, Train Acc: 0.2078, time: 3692.07
+INFO:master_logger:----- Epoch[034/300], Train Loss: 4.1141, Train Acc: 0.2015, time: 3692.07
+INFO:local_logger:----- Epoch[034/300], Train Loss: 4.1279, Train Acc: 0.1975, time: 3692.41
+INFO:local_logger:Now training epoch 35. LR=0.000381
+INFO:local_logger:----- Epoch[034/300], Train Loss: 4.1090, Train Acc: 0.2005, time: 3692.44
+INFO:local_logger:Now training epoch 35. LR=0.000381
+INFO:local_logger:----- Epoch[034/300], Train Loss: 4.1200, Train Acc: 0.2003, time: 3692.55
+INFO:local_logger:Now training epoch 35. LR=0.000381
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-34-Loss-4.0996046095225624.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-34-Loss-4.0996046095225624.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-34-Loss-4.0996046095225624-EMA.pdparams
+INFO:local_logger:Now training epoch 35. LR=0.000381
+INFO:master_logger:Now training epoch 35. LR=0.000381
+INFO:local_logger:Epoch[035/300], Step[0000/1602], Avg Loss: 4.6234, Avg Acc: 0.0750
+INFO:local_logger:Epoch[035/300], Step[0000/1602], Avg Loss: 4.5178, Avg Acc: 0.1750
+INFO:master_logger:Epoch[035/300], Step[0000/1602], Avg Loss: 4.4914, Avg Acc: 0.1513
+INFO:local_logger:Epoch[035/300], Step[0000/1602], Avg Loss: 4.6060, Avg Acc: 0.0900
+INFO:local_logger:Epoch[035/300], Step[0000/1602], Avg Loss: 4.2184, Avg Acc: 0.2650
+INFO:local_logger:Epoch[035/300], Step[0050/1602], Avg Loss: 3.9832, Avg Acc: 0.2128
+INFO:local_logger:Epoch[035/300], Step[0050/1602], Avg Loss: 4.1241, Avg Acc: 0.1650
+INFO:local_logger:Epoch[035/300], Step[0050/1602], Avg Loss: 4.1499, Avg Acc: 0.1675
+INFO:local_logger:Epoch[035/300], Step[0050/1602], Avg Loss: 4.0265, Avg Acc: 0.1901
+INFO:master_logger:Epoch[035/300], Step[0050/1602], Avg Loss: 4.0709, Avg Acc: 0.1838
+INFO:local_logger:Epoch[035/300], Step[0100/1602], Avg Loss: 4.1340, Avg Acc: 0.1893
+INFO:local_logger:Epoch[035/300], Step[0100/1602], Avg Loss: 4.0501, Avg Acc: 0.2008
+INFO:local_logger:Epoch[035/300], Step[0100/1602], Avg Loss: 4.1328, Avg Acc: 0.2017
+INFO:local_logger:Epoch[035/300], Step[0100/1602], Avg Loss: 4.0741, Avg Acc: 0.1894
+INFO:master_logger:Epoch[035/300], Step[0100/1602], Avg Loss: 4.0977, Avg Acc: 0.1953
+INFO:local_logger:Epoch[035/300], Step[0150/1602], Avg Loss: 4.1248, Avg Acc: 0.2028
+INFO:local_logger:Epoch[035/300], Step[0150/1602], Avg Loss: 4.0890, Avg Acc: 0.2110
+INFO:local_logger:Epoch[035/300], Step[0150/1602], Avg Loss: 4.0751, Avg Acc: 0.1991
+INFO:local_logger:Epoch[035/300], Step[0150/1602], Avg Loss: 4.0851, Avg Acc: 0.1890
+INFO:master_logger:Epoch[035/300], Step[0150/1602], Avg Loss: 4.0935, Avg Acc: 0.2005
+INFO:local_logger:Epoch[035/300], Step[0200/1602], Avg Loss: 4.1163, Avg Acc: 0.1959
+INFO:local_logger:Epoch[035/300], Step[0200/1602], Avg Loss: 4.0669, Avg Acc: 0.2035
+INFO:local_logger:Epoch[035/300], Step[0200/1602], Avg Loss: 4.1044, Avg Acc: 0.1847
+INFO:local_logger:Epoch[035/300], Step[0200/1602], Avg Loss: 4.0552, Avg Acc: 0.2176
+INFO:master_logger:Epoch[035/300], Step[0200/1602], Avg Loss: 4.0857, Avg Acc: 0.2004
+INFO:local_logger:Epoch[035/300], Step[0250/1602], Avg Loss: 4.0734, Avg Acc: 0.1984
+INFO:local_logger:Epoch[035/300], Step[0250/1602], Avg Loss: 4.1128, Avg Acc: 0.1941
+INFO:local_logger:Epoch[035/300], Step[0250/1602], Avg Loss: 4.0752, Avg Acc: 0.2145
+INFO:local_logger:Epoch[035/300], Step[0250/1602], Avg Loss: 4.0509, Avg Acc: 0.2063
+INFO:master_logger:Epoch[035/300], Step[0250/1602], Avg Loss: 4.0781, Avg Acc: 0.2033
+INFO:local_logger:Epoch[035/300], Step[0300/1602], Avg Loss: 4.0965, Avg Acc: 0.1926
+INFO:local_logger:Epoch[035/300], Step[0300/1602], Avg Loss: 4.0768, Avg Acc: 0.1946
+INFO:local_logger:Epoch[035/300], Step[0300/1602], Avg Loss: 4.0743, Avg Acc: 0.2155
+INFO:local_logger:Epoch[035/300], Step[0300/1602], Avg Loss: 4.0594, Avg Acc: 0.2008
+INFO:master_logger:Epoch[035/300], Step[0300/1602], Avg Loss: 4.0767, Avg Acc: 0.2009
+INFO:local_logger:Epoch[035/300], Step[0350/1602], Avg Loss: 4.1027, Avg Acc: 0.1954
+INFO:local_logger:Epoch[035/300], Step[0350/1602], Avg Loss: 4.0843, Avg Acc: 0.1938
+INFO:local_logger:Epoch[035/300], Step[0350/1602], Avg Loss: 4.0646, Avg Acc: 0.2013
+INFO:local_logger:Epoch[035/300], Step[0350/1602], Avg Loss: 4.0624, Avg Acc: 0.2166
+INFO:master_logger:Epoch[035/300], Step[0350/1602], Avg Loss: 4.0785, Avg Acc: 0.2018
+INFO:local_logger:Epoch[035/300], Step[0400/1602], Avg Loss: 4.0963, Avg Acc: 0.1945
+INFO:master_logger:Epoch[035/300], Step[0400/1602], Avg Loss: 4.0753, Avg Acc: 0.2019
+INFO:local_logger:Epoch[035/300], Step[0400/1602], Avg Loss: 4.0621, Avg Acc: 0.2153
+INFO:local_logger:Epoch[035/300], Step[0400/1602], Avg Loss: 4.0674, Avg Acc: 0.2008
+INFO:local_logger:Epoch[035/300], Step[0400/1602], Avg Loss: 4.0754, Avg Acc: 0.1970
+INFO:local_logger:Epoch[035/300], Step[0450/1602], Avg Loss: 4.0847, Avg Acc: 0.1931
+INFO:local_logger:Epoch[035/300], Step[0450/1602], Avg Loss: 4.0743, Avg Acc: 0.2012
+INFO:local_logger:Epoch[035/300], Step[0450/1602], Avg Loss: 4.0515, Avg Acc: 0.2153
+INFO:master_logger:Epoch[035/300], Step[0450/1602], Avg Loss: 4.0710, Avg Acc: 0.2020
+INFO:local_logger:Epoch[035/300], Step[0450/1602], Avg Loss: 4.0733, Avg Acc: 0.1984
+INFO:local_logger:Epoch[035/300], Step[0500/1602], Avg Loss: 4.0773, Avg Acc: 0.2018
+INFO:local_logger:Epoch[035/300], Step[0500/1602], Avg Loss: 4.0722, Avg Acc: 0.1943
+INFO:local_logger:Epoch[035/300], Step[0500/1602], Avg Loss: 4.0943, Avg Acc: 0.2002
+INFO:local_logger:Epoch[035/300], Step[0500/1602], Avg Loss: 4.0579, Avg Acc: 0.2136
+INFO:master_logger:Epoch[035/300], Step[0500/1602], Avg Loss: 4.0754, Avg Acc: 0.2025
+INFO:local_logger:Epoch[035/300], Step[0550/1602], Avg Loss: 4.0832, Avg Acc: 0.1948
+INFO:local_logger:Epoch[035/300], Step[0550/1602], Avg Loss: 4.1015, Avg Acc: 0.1990
+INFO:local_logger:Epoch[035/300], Step[0550/1602], Avg Loss: 4.0816, Avg Acc: 0.2022
+INFO:local_logger:Epoch[035/300], Step[0550/1602], Avg Loss: 4.0505, Avg Acc: 0.2131
+INFO:master_logger:Epoch[035/300], Step[0550/1602], Avg Loss: 4.0792, Avg Acc: 0.2023
+INFO:local_logger:Epoch[035/300], Step[0600/1602], Avg Loss: 4.0822, Avg Acc: 0.2026
+INFO:local_logger:Epoch[035/300], Step[0600/1602], Avg Loss: 4.0835, Avg Acc: 0.1948
+INFO:local_logger:Epoch[035/300], Step[0600/1602], Avg Loss: 4.0858, Avg Acc: 0.1999
+INFO:local_logger:Epoch[035/300], Step[0600/1602], Avg Loss: 4.0593, Avg Acc: 0.2109
+INFO:master_logger:Epoch[035/300], Step[0600/1602], Avg Loss: 4.0777, Avg Acc: 0.2021
+INFO:local_logger:Epoch[035/300], Step[0650/1602], Avg Loss: 4.0602, Avg Acc: 0.2091
+INFO:local_logger:Epoch[035/300], Step[0650/1602], Avg Loss: 4.0927, Avg Acc: 0.1954
+INFO:local_logger:Epoch[035/300], Step[0650/1602], Avg Loss: 4.0845, Avg Acc: 0.2033
+INFO:local_logger:Epoch[035/300], Step[0650/1602], Avg Loss: 4.0873, Avg Acc: 0.1989
+INFO:master_logger:Epoch[035/300], Step[0650/1602], Avg Loss: 4.0812, Avg Acc: 0.2017
+INFO:local_logger:Epoch[035/300], Step[0700/1602], Avg Loss: 4.0954, Avg Acc: 0.1962
+INFO:local_logger:Epoch[035/300], Step[0700/1602], Avg Loss: 4.0823, Avg Acc: 0.2036
+INFO:local_logger:Epoch[035/300], Step[0700/1602], Avg Loss: 4.0966, Avg Acc: 0.1989
+INFO:local_logger:Epoch[035/300], Step[0700/1602], Avg Loss: 4.0634, Avg Acc: 0.2083
+INFO:master_logger:Epoch[035/300], Step[0700/1602], Avg Loss: 4.0844, Avg Acc: 0.2017
+INFO:local_logger:Epoch[035/300], Step[0750/1602], Avg Loss: 4.0999, Avg Acc: 0.1964
+INFO:local_logger:Epoch[035/300], Step[0750/1602], Avg Loss: 4.0853, Avg Acc: 0.2063
+INFO:local_logger:Epoch[035/300], Step[0750/1602], Avg Loss: 4.0939, Avg Acc: 0.2002
+INFO:master_logger:Epoch[035/300], Step[0750/1602], Avg Loss: 4.0849, Avg Acc: 0.2029
+INFO:local_logger:Epoch[035/300], Step[0750/1602], Avg Loss: 4.0606, Avg Acc: 0.2087
+INFO:local_logger:Epoch[035/300], Step[0800/1602], Avg Loss: 4.0815, Avg Acc: 0.2064
+INFO:local_logger:Epoch[035/300], Step[0800/1602], Avg Loss: 4.1085, Avg Acc: 0.1950
+INFO:local_logger:Epoch[035/300], Step[0800/1602], Avg Loss: 4.0929, Avg Acc: 0.2005
+INFO:local_logger:Epoch[035/300], Step[0800/1602], Avg Loss: 4.0630, Avg Acc: 0.2064
+INFO:master_logger:Epoch[035/300], Step[0800/1602], Avg Loss: 4.0865, Avg Acc: 0.2021
+INFO:local_logger:Epoch[035/300], Step[0850/1602], Avg Loss: 4.0890, Avg Acc: 0.2010
+INFO:local_logger:Epoch[035/300], Step[0850/1602], Avg Loss: 4.1099, Avg Acc: 0.1937
+INFO:local_logger:Epoch[035/300], Step[0850/1602], Avg Loss: 4.0624, Avg Acc: 0.2055
+INFO:local_logger:Epoch[035/300], Step[0850/1602], Avg Loss: 4.0877, Avg Acc: 0.2067
+INFO:master_logger:Epoch[035/300], Step[0850/1602], Avg Loss: 4.0873, Avg Acc: 0.2017
+INFO:local_logger:Epoch[035/300], Step[0900/1602], Avg Loss: 4.1154, Avg Acc: 0.1944
+INFO:local_logger:Epoch[035/300], Step[0900/1602], Avg Loss: 4.0932, Avg Acc: 0.2007
+INFO:master_logger:Epoch[035/300], Step[0900/1602], Avg Loss: 4.0929, Avg Acc: 0.2014
+INFO:local_logger:Epoch[035/300], Step[0900/1602], Avg Loss: 4.0916, Avg Acc: 0.2060
+INFO:local_logger:Epoch[035/300], Step[0900/1602], Avg Loss: 4.0716, Avg Acc: 0.2046
+INFO:local_logger:Epoch[035/300], Step[0950/1602], Avg Loss: 4.1121, Avg Acc: 0.1946
+INFO:local_logger:Epoch[035/300], Step[0950/1602], Avg Loss: 4.0949, Avg Acc: 0.2019
+INFO:master_logger:Epoch[035/300], Step[0950/1602], Avg Loss: 4.0913, Avg Acc: 0.2017
+INFO:local_logger:Epoch[035/300], Step[0950/1602], Avg Loss: 4.0918, Avg Acc: 0.2061
+INFO:local_logger:Epoch[035/300], Step[0950/1602], Avg Loss: 4.0665, Avg Acc: 0.2040
+INFO:local_logger:Epoch[035/300], Step[1000/1602], Avg Loss: 4.1075, Avg Acc: 0.1946
+INFO:local_logger:Epoch[035/300], Step[1000/1602], Avg Loss: 4.0914, Avg Acc: 0.2053
+INFO:local_logger:Epoch[035/300], Step[1000/1602], Avg Loss: 4.0958, Avg Acc: 0.1990
+INFO:master_logger:Epoch[035/300], Step[1000/1602], Avg Loss: 4.0895, Avg Acc: 0.2008
+INFO:local_logger:Epoch[035/300], Step[1000/1602], Avg Loss: 4.0632, Avg Acc: 0.2040
+INFO:local_logger:Epoch[035/300], Step[1050/1602], Avg Loss: 4.0905, Avg Acc: 0.2039
+INFO:local_logger:Epoch[035/300], Step[1050/1602], Avg Loss: 4.1105, Avg Acc: 0.1943
+INFO:local_logger:Epoch[035/300], Step[1050/1602], Avg Loss: 4.0651, Avg Acc: 0.2037
+INFO:local_logger:Epoch[035/300], Step[1050/1602], Avg Loss: 4.0930, Avg Acc: 0.1989
+INFO:master_logger:Epoch[035/300], Step[1050/1602], Avg Loss: 4.0898, Avg Acc: 0.2002
+INFO:local_logger:Epoch[035/300], Step[1100/1602], Avg Loss: 4.1073, Avg Acc: 0.1966
+INFO:local_logger:Epoch[035/300], Step[1100/1602], Avg Loss: 4.0942, Avg Acc: 0.1986
+INFO:local_logger:Epoch[035/300], Step[1100/1602], Avg Loss: 4.0600, Avg Acc: 0.2044
+INFO:master_logger:Epoch[035/300], Step[1100/1602], Avg Loss: 4.0890, Avg Acc: 0.2005
+INFO:local_logger:Epoch[035/300], Step[1100/1602], Avg Loss: 4.0944, Avg Acc: 0.2023
+INFO:local_logger:Epoch[035/300], Step[1150/1602], Avg Loss: 4.0943, Avg Acc: 0.1988
+INFO:local_logger:Epoch[035/300], Step[1150/1602], Avg Loss: 4.1084, Avg Acc: 0.1957
+INFO:local_logger:Epoch[035/300], Step[1150/1602], Avg Loss: 4.0954, Avg Acc: 0.2011
+INFO:local_logger:Epoch[035/300], Step[1150/1602], Avg Loss: 4.0644, Avg Acc: 0.2041
+INFO:master_logger:Epoch[035/300], Step[1150/1602], Avg Loss: 4.0906, Avg Acc: 0.1999
+INFO:local_logger:Epoch[035/300], Step[1200/1602], Avg Loss: 4.1086, Avg Acc: 0.1956
+INFO:local_logger:Epoch[035/300], Step[1200/1602], Avg Loss: 4.0982, Avg Acc: 0.1986
+INFO:local_logger:Epoch[035/300], Step[1200/1602], Avg Loss: 4.0594, Avg Acc: 0.2062
+INFO:local_logger:Epoch[035/300], Step[1200/1602], Avg Loss: 4.0895, Avg Acc: 0.2031
+INFO:master_logger:Epoch[035/300], Step[1200/1602], Avg Loss: 4.0889, Avg Acc: 0.2009
+INFO:local_logger:Epoch[035/300], Step[1250/1602], Avg Loss: 4.1055, Avg Acc: 0.1964
+INFO:local_logger:Epoch[035/300], Step[1250/1602], Avg Loss: 4.0896, Avg Acc: 0.2029
+INFO:local_logger:Epoch[035/300], Step[1250/1602], Avg Loss: 4.0598, Avg Acc: 0.2048
+INFO:local_logger:Epoch[035/300], Step[1250/1602], Avg Loss: 4.1062, Avg Acc: 0.1968
+INFO:master_logger:Epoch[035/300], Step[1250/1602], Avg Loss: 4.0903, Avg Acc: 0.2002
+INFO:local_logger:Epoch[035/300], Step[1300/1602], Avg Loss: 4.1007, Avg Acc: 0.1981
+INFO:local_logger:Epoch[035/300], Step[1300/1602], Avg Loss: 4.0959, Avg Acc: 0.2026
+INFO:local_logger:Epoch[035/300], Step[1300/1602], Avg Loss: 4.0619, Avg Acc: 0.2045
+INFO:local_logger:Epoch[035/300], Step[1300/1602], Avg Loss: 4.1065, Avg Acc: 0.1976
+INFO:master_logger:Epoch[035/300], Step[1300/1602], Avg Loss: 4.0912, Avg Acc: 0.2007
+INFO:local_logger:Epoch[035/300], Step[1350/1602], Avg Loss: 4.1037, Avg Acc: 0.1985
+INFO:local_logger:Epoch[035/300], Step[1350/1602], Avg Loss: 4.0658, Avg Acc: 0.2044
+INFO:local_logger:Epoch[035/300], Step[1350/1602], Avg Loss: 4.1001, Avg Acc: 0.2022
+INFO:local_logger:Epoch[035/300], Step[1350/1602], Avg Loss: 4.1065, Avg Acc: 0.1983
+INFO:master_logger:Epoch[035/300], Step[1350/1602], Avg Loss: 4.0940, Avg Acc: 0.2008
+INFO:local_logger:Epoch[035/300], Step[1400/1602], Avg Loss: 4.0689, Avg Acc: 0.2039
+INFO:local_logger:Epoch[035/300], Step[1400/1602], Avg Loss: 4.1038, Avg Acc: 0.1973
+INFO:local_logger:Epoch[035/300], Step[1400/1602], Avg Loss: 4.1097, Avg Acc: 0.1983
+INFO:local_logger:Epoch[035/300], Step[1400/1602], Avg Loss: 4.0984, Avg Acc: 0.2016
+INFO:master_logger:Epoch[035/300], Step[1400/1602], Avg Loss: 4.0952, Avg Acc: 0.2003
+INFO:local_logger:Epoch[035/300], Step[1450/1602], Avg Loss: 4.1137, Avg Acc: 0.1974
+INFO:local_logger:Epoch[035/300], Step[1450/1602], Avg Loss: 4.0990, Avg Acc: 0.1968
+INFO:local_logger:Epoch[035/300], Step[1450/1602], Avg Loss: 4.0956, Avg Acc: 0.2020
+INFO:local_logger:Epoch[035/300], Step[1450/1602], Avg Loss: 4.0707, Avg Acc: 0.2040
+INFO:master_logger:Epoch[035/300], Step[1450/1602], Avg Loss: 4.0948, Avg Acc: 0.2001
+INFO:local_logger:Epoch[035/300], Step[1500/1602], Avg Loss: 4.0978, Avg Acc: 0.1969
+INFO:local_logger:Epoch[035/300], Step[1500/1602], Avg Loss: 4.0994, Avg Acc: 0.2016
+INFO:local_logger:Epoch[035/300], Step[1500/1602], Avg Loss: 4.1113, Avg Acc: 0.1979
+INFO:local_logger:Epoch[035/300], Step[1500/1602], Avg Loss: 4.0737, Avg Acc: 0.2031
+INFO:master_logger:Epoch[035/300], Step[1500/1602], Avg Loss: 4.0955, Avg Acc: 0.1999
+INFO:local_logger:Epoch[035/300], Step[1550/1602], Avg Loss: 4.1036, Avg Acc: 0.1965
+INFO:local_logger:Epoch[035/300], Step[1550/1602], Avg Loss: 4.1031, Avg Acc: 0.2017
+INFO:local_logger:Epoch[035/300], Step[1550/1602], Avg Loss: 4.0791, Avg Acc: 0.2019
+INFO:local_logger:Epoch[035/300], Step[1550/1602], Avg Loss: 4.1112, Avg Acc: 0.1983
+INFO:master_logger:Epoch[035/300], Step[1550/1602], Avg Loss: 4.0992, Avg Acc: 0.1996
+INFO:local_logger:Epoch[035/300], Step[1600/1602], Avg Loss: 4.0762, Avg Acc: 0.2023
+INFO:local_logger:Epoch[035/300], Step[1600/1602], Avg Loss: 4.1004, Avg Acc: 0.2019
+INFO:local_logger:Epoch[035/300], Step[1600/1602], Avg Loss: 4.1099, Avg Acc: 0.1983
+INFO:local_logger:Epoch[035/300], Step[1600/1602], Avg Loss: 4.1055, Avg Acc: 0.1961
+INFO:master_logger:Epoch[035/300], Step[1600/1602], Avg Loss: 4.0980, Avg Acc: 0.1997
+INFO:local_logger:----- Epoch[035/300], Train Loss: 4.1006, Train Acc: 0.2019, time: 3701.30
+INFO:local_logger:Now training epoch 36. LR=0.000380
+INFO:local_logger:----- Epoch[035/300], Train Loss: 4.1057, Train Acc: 0.1961, time: 3701.10
+INFO:local_logger:----- Epoch[035/300], Train Loss: 4.0764, Train Acc: 0.2023, time: 3701.26
+INFO:master_logger:----- Epoch[035/300], Train Loss: 4.0981, Train Acc: 0.1996, time: 3701.10
+INFO:local_logger:Now training epoch 36. LR=0.000380
+INFO:local_logger:----- Epoch[035/300], Train Loss: 4.1098, Train Acc: 0.1982, time: 3701.27
+INFO:local_logger:Now training epoch 36. LR=0.000380
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-35-Loss-4.105662516944482.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-35-Loss-4.105662516944482.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-35-Loss-4.105662516944482-EMA.pdparams
+INFO:local_logger:Now training epoch 36. LR=0.000380
+INFO:master_logger:Now training epoch 36. LR=0.000380
+INFO:local_logger:Epoch[036/300], Step[0000/1602], Avg Loss: 4.5087, Avg Acc: 0.2250
+INFO:local_logger:Epoch[036/300], Step[0000/1602], Avg Loss: 4.4512, Avg Acc: 0.1950
+INFO:local_logger:Epoch[036/300], Step[0000/1602], Avg Loss: 3.0371, Avg Acc: 0.0000
+INFO:local_logger:Epoch[036/300], Step[0000/1602], Avg Loss: 4.0206, Avg Acc: 0.0250
+INFO:master_logger:Epoch[036/300], Step[0000/1602], Avg Loss: 4.0044, Avg Acc: 0.1112
+INFO:local_logger:Epoch[036/300], Step[0050/1602], Avg Loss: 4.1064, Avg Acc: 0.2185
+INFO:local_logger:Epoch[036/300], Step[0050/1602], Avg Loss: 4.0680, Avg Acc: 0.1989
+INFO:local_logger:Epoch[036/300], Step[0050/1602], Avg Loss: 4.0427, Avg Acc: 0.2069
+INFO:local_logger:Epoch[036/300], Step[0050/1602], Avg Loss: 4.2066, Avg Acc: 0.2241
+INFO:master_logger:Epoch[036/300], Step[0050/1602], Avg Loss: 4.1059, Avg Acc: 0.2121
+INFO:local_logger:Epoch[036/300], Step[0100/1602], Avg Loss: 4.0666, Avg Acc: 0.2027
+INFO:local_logger:Epoch[036/300], Step[0100/1602], Avg Loss: 4.0205, Avg Acc: 0.2090
+INFO:local_logger:Epoch[036/300], Step[0100/1602], Avg Loss: 4.0903, Avg Acc: 0.2108
+INFO:local_logger:Epoch[036/300], Step[0100/1602], Avg Loss: 4.1189, Avg Acc: 0.2046
+INFO:master_logger:Epoch[036/300], Step[0100/1602], Avg Loss: 4.0741, Avg Acc: 0.2068
+INFO:local_logger:Epoch[036/300], Step[0150/1602], Avg Loss: 4.0190, Avg Acc: 0.2022
+INFO:local_logger:Epoch[036/300], Step[0150/1602], Avg Loss: 4.1189, Avg Acc: 0.2113
+INFO:local_logger:Epoch[036/300], Step[0150/1602], Avg Loss: 4.0815, Avg Acc: 0.2092
+INFO:local_logger:Epoch[036/300], Step[0150/1602], Avg Loss: 4.0578, Avg Acc: 0.2146
+INFO:master_logger:Epoch[036/300], Step[0150/1602], Avg Loss: 4.0693, Avg Acc: 0.2093
+INFO:local_logger:Epoch[036/300], Step[0200/1602], Avg Loss: 4.0404, Avg Acc: 0.2012
+INFO:local_logger:Epoch[036/300], Step[0200/1602], Avg Loss: 4.0472, Avg Acc: 0.2151
+INFO:local_logger:Epoch[036/300], Step[0200/1602], Avg Loss: 4.1271, Avg Acc: 0.2125
+INFO:local_logger:Epoch[036/300], Step[0200/1602], Avg Loss: 4.0792, Avg Acc: 0.2062
+INFO:master_logger:Epoch[036/300], Step[0200/1602], Avg Loss: 4.0735, Avg Acc: 0.2088
+INFO:local_logger:Epoch[036/300], Step[0250/1602], Avg Loss: 4.0462, Avg Acc: 0.2029
+INFO:local_logger:Epoch[036/300], Step[0250/1602], Avg Loss: 4.0383, Avg Acc: 0.2088
+INFO:local_logger:Epoch[036/300], Step[0250/1602], Avg Loss: 4.0987, Avg Acc: 0.2104
+INFO:master_logger:Epoch[036/300], Step[0250/1602], Avg Loss: 4.0763, Avg Acc: 0.2064
+INFO:local_logger:Epoch[036/300], Step[0250/1602], Avg Loss: 4.1222, Avg Acc: 0.2035
+INFO:local_logger:Epoch[036/300], Step[0300/1602], Avg Loss: 4.1225, Avg Acc: 0.1992
+INFO:local_logger:Epoch[036/300], Step[0300/1602], Avg Loss: 4.0499, Avg Acc: 0.1966
+INFO:local_logger:Epoch[036/300], Step[0300/1602], Avg Loss: 4.0607, Avg Acc: 0.2075
+INFO:local_logger:Epoch[036/300], Step[0300/1602], Avg Loss: 4.0979, Avg Acc: 0.2093
+INFO:master_logger:Epoch[036/300], Step[0300/1602], Avg Loss: 4.0827, Avg Acc: 0.2032
+INFO:local_logger:Epoch[036/300], Step[0350/1602], Avg Loss: 4.0624, Avg Acc: 0.2010
+INFO:local_logger:Epoch[036/300], Step[0350/1602], Avg Loss: 4.0716, Avg Acc: 0.2058
+INFO:local_logger:Epoch[036/300], Step[0350/1602], Avg Loss: 4.1405, Avg Acc: 0.1991
+INFO:master_logger:Epoch[036/300], Step[0350/1602], Avg Loss: 4.0885, Avg Acc: 0.2032
+INFO:local_logger:Epoch[036/300], Step[0350/1602], Avg Loss: 4.0794, Avg Acc: 0.2067
+INFO:local_logger:Epoch[036/300], Step[0400/1602], Avg Loss: 4.0493, Avg Acc: 0.1998
+INFO:local_logger:Epoch[036/300], Step[0400/1602], Avg Loss: 4.1268, Avg Acc: 0.2034
+INFO:local_logger:Epoch[036/300], Step[0400/1602], Avg Loss: 4.0759, Avg Acc: 0.2025
+INFO:local_logger:Epoch[036/300], Step[0400/1602], Avg Loss: 4.0822, Avg Acc: 0.2015
+INFO:master_logger:Epoch[036/300], Step[0400/1602], Avg Loss: 4.0836, Avg Acc: 0.2018
+INFO:local_logger:Epoch[036/300], Step[0450/1602], Avg Loss: 4.0459, Avg Acc: 0.2020
+INFO:local_logger:Epoch[036/300], Step[0450/1602], Avg Loss: 4.0826, Avg Acc: 0.2022
+INFO:local_logger:Epoch[036/300], Step[0450/1602], Avg Loss: 4.1175, Avg Acc: 0.2028
+INFO:local_logger:Epoch[036/300], Step[0450/1602], Avg Loss: 4.0779, Avg Acc: 0.2021
+INFO:master_logger:Epoch[036/300], Step[0450/1602], Avg Loss: 4.0810, Avg Acc: 0.2023
+INFO:local_logger:Epoch[036/300], Step[0500/1602], Avg Loss: 4.0602, Avg Acc: 0.2005
+INFO:local_logger:Epoch[036/300], Step[0500/1602], Avg Loss: 4.1074, Avg Acc: 0.2021
+INFO:local_logger:Epoch[036/300], Step[0500/1602], Avg Loss: 4.0791, Avg Acc: 0.2022
+INFO:local_logger:Epoch[036/300], Step[0500/1602], Avg Loss: 4.0806, Avg Acc: 0.2039
+INFO:master_logger:Epoch[036/300], Step[0500/1602], Avg Loss: 4.0818, Avg Acc: 0.2022
+INFO:local_logger:Epoch[036/300], Step[0550/1602], Avg Loss: 4.0604, Avg Acc: 0.2042
+INFO:local_logger:Epoch[036/300], Step[0550/1602], Avg Loss: 4.1043, Avg Acc: 0.2003
+INFO:master_logger:Epoch[036/300], Step[0550/1602], Avg Loss: 4.0835, Avg Acc: 0.2025
+INFO:local_logger:Epoch[036/300], Step[0550/1602], Avg Loss: 4.0821, Avg Acc: 0.2018
+INFO:local_logger:Epoch[036/300], Step[0550/1602], Avg Loss: 4.0871, Avg Acc: 0.2035
+INFO:local_logger:Epoch[036/300], Step[0600/1602], Avg Loss: 4.0667, Avg Acc: 0.2033
+INFO:local_logger:Epoch[036/300], Step[0600/1602], Avg Loss: 4.1029, Avg Acc: 0.2026
+INFO:master_logger:Epoch[036/300], Step[0600/1602], Avg Loss: 4.0872, Avg Acc: 0.2020
+INFO:local_logger:Epoch[036/300], Step[0600/1602], Avg Loss: 4.0862, Avg Acc: 0.1990
+INFO:local_logger:Epoch[036/300], Step[0600/1602], Avg Loss: 4.0930, Avg Acc: 0.2033
+INFO:local_logger:Epoch[036/300], Step[0650/1602], Avg Loss: 4.0983, Avg Acc: 0.2026
+INFO:local_logger:Epoch[036/300], Step[0650/1602], Avg Loss: 4.0837, Avg Acc: 0.1968
+INFO:local_logger:Epoch[036/300], Step[0650/1602], Avg Loss: 4.0826, Avg Acc: 0.2051
+INFO:local_logger:Epoch[036/300], Step[0650/1602], Avg Loss: 4.0747, Avg Acc: 0.2037
+INFO:master_logger:Epoch[036/300], Step[0650/1602], Avg Loss: 4.0848, Avg Acc: 0.2021
+INFO:local_logger:Epoch[036/300], Step[0700/1602], Avg Loss: 4.0836, Avg Acc: 0.1950
+INFO:local_logger:Epoch[036/300], Step[0700/1602], Avg Loss: 4.0820, Avg Acc: 0.2056
+INFO:local_logger:Epoch[036/300], Step[0700/1602], Avg Loss: 4.0944, Avg Acc: 0.2041
+INFO:local_logger:Epoch[036/300], Step[0700/1602], Avg Loss: 4.0709, Avg Acc: 0.2040
+INFO:master_logger:Epoch[036/300], Step[0700/1602], Avg Loss: 4.0827, Avg Acc: 0.2022
+INFO:local_logger:Epoch[036/300], Step[0750/1602], Avg Loss: 4.0640, Avg Acc: 0.2068
+INFO:local_logger:Epoch[036/300], Step[0750/1602], Avg Loss: 4.0764, Avg Acc: 0.2073
+INFO:local_logger:Epoch[036/300], Step[0750/1602], Avg Loss: 4.0794, Avg Acc: 0.1964
+INFO:master_logger:Epoch[036/300], Step[0750/1602], Avg Loss: 4.0791, Avg Acc: 0.2035
+INFO:local_logger:Epoch[036/300], Step[0750/1602], Avg Loss: 4.0965, Avg Acc: 0.2036
+INFO:local_logger:Epoch[036/300], Step[0800/1602], Avg Loss: 4.0718, Avg Acc: 0.2051
+INFO:local_logger:Epoch[036/300], Step[0800/1602], Avg Loss: 4.1000, Avg Acc: 0.2050
+INFO:local_logger:Epoch[036/300], Step[0800/1602], Avg Loss: 4.0729, Avg Acc: 0.2084
+INFO:master_logger:Epoch[036/300], Step[0800/1602], Avg Loss: 4.0782, Avg Acc: 0.2040
+INFO:local_logger:Epoch[036/300], Step[0800/1602], Avg Loss: 4.0682, Avg Acc: 0.1973
+INFO:local_logger:Epoch[036/300], Step[0850/1602], Avg Loss: 4.0705, Avg Acc: 0.2058
+INFO:local_logger:Epoch[036/300], Step[0850/1602], Avg Loss: 4.0966, Avg Acc: 0.2029
+INFO:local_logger:Epoch[036/300], Step[0850/1602], Avg Loss: 4.0687, Avg Acc: 0.1987
+INFO:local_logger:Epoch[036/300], Step[0850/1602], Avg Loss: 4.0724, Avg Acc: 0.2104
+INFO:master_logger:Epoch[036/300], Step[0850/1602], Avg Loss: 4.0771, Avg Acc: 0.2044
+INFO:local_logger:Epoch[036/300], Step[0900/1602], Avg Loss: 4.0706, Avg Acc: 0.2071
+INFO:local_logger:Epoch[036/300], Step[0900/1602], Avg Loss: 4.0741, Avg Acc: 0.1984
+INFO:local_logger:Epoch[036/300], Step[0900/1602], Avg Loss: 4.0913, Avg Acc: 0.2040
+INFO:local_logger:Epoch[036/300], Step[0900/1602], Avg Loss: 4.0755, Avg Acc: 0.2115
+INFO:master_logger:Epoch[036/300], Step[0900/1602], Avg Loss: 4.0779, Avg Acc: 0.2052
+INFO:local_logger:Epoch[036/300], Step[0950/1602], Avg Loss: 4.0873, Avg Acc: 0.2061
+INFO:local_logger:Epoch[036/300], Step[0950/1602], Avg Loss: 4.0770, Avg Acc: 0.2076
+INFO:local_logger:Epoch[036/300], Step[0950/1602], Avg Loss: 4.0733, Avg Acc: 0.1986
+INFO:local_logger:Epoch[036/300], Step[0950/1602], Avg Loss: 4.0807, Avg Acc: 0.2105
+INFO:master_logger:Epoch[036/300], Step[0950/1602], Avg Loss: 4.0796, Avg Acc: 0.2057
+INFO:local_logger:Epoch[036/300], Step[1000/1602], Avg Loss: 4.0783, Avg Acc: 0.2082
+INFO:local_logger:Epoch[036/300], Step[1000/1602], Avg Loss: 4.0717, Avg Acc: 0.1969
+INFO:local_logger:Epoch[036/300], Step[1000/1602], Avg Loss: 4.0873, Avg Acc: 0.2072
+INFO:master_logger:Epoch[036/300], Step[1000/1602], Avg Loss: 4.0786, Avg Acc: 0.2061
+INFO:local_logger:Epoch[036/300], Step[1000/1602], Avg Loss: 4.0771, Avg Acc: 0.2120
+INFO:local_logger:Epoch[036/300], Step[1050/1602], Avg Loss: 4.0768, Avg Acc: 0.2084
+INFO:local_logger:Epoch[036/300], Step[1050/1602], Avg Loss: 4.0788, Avg Acc: 0.2108
+INFO:local_logger:Epoch[036/300], Step[1050/1602], Avg Loss: 4.0878, Avg Acc: 0.2077
+INFO:local_logger:Epoch[036/300], Step[1050/1602], Avg Loss: 4.0754, Avg Acc: 0.1968
+INFO:master_logger:Epoch[036/300], Step[1050/1602], Avg Loss: 4.0797, Avg Acc: 0.2059
+INFO:local_logger:Epoch[036/300], Step[1100/1602], Avg Loss: 4.0725, Avg Acc: 0.2082
+INFO:local_logger:Epoch[036/300], Step[1100/1602], Avg Loss: 4.0852, Avg Acc: 0.2066
+INFO:local_logger:Epoch[036/300], Step[1100/1602], Avg Loss: 4.0788, Avg Acc: 0.1979
+INFO:master_logger:Epoch[036/300], Step[1100/1602], Avg Loss: 4.0784, Avg Acc: 0.2058
+INFO:local_logger:Epoch[036/300], Step[1100/1602], Avg Loss: 4.0770, Avg Acc: 0.2103
+INFO:local_logger:Epoch[036/300], Step[1150/1602], Avg Loss: 4.0730, Avg Acc: 0.2075
+INFO:local_logger:Epoch[036/300], Step[1150/1602], Avg Loss: 4.0857, Avg Acc: 0.2067
+INFO:local_logger:Epoch[036/300], Step[1150/1602], Avg Loss: 4.0803, Avg Acc: 0.1991
+INFO:local_logger:Epoch[036/300], Step[1150/1602], Avg Loss: 4.0741, Avg Acc: 0.2098
+INFO:master_logger:Epoch[036/300], Step[1150/1602], Avg Loss: 4.0783, Avg Acc: 0.2058
+INFO:local_logger:Epoch[036/300], Step[1200/1602], Avg Loss: 4.0743, Avg Acc: 0.2070
+INFO:local_logger:Epoch[036/300], Step[1200/1602], Avg Loss: 4.0756, Avg Acc: 0.1977
+INFO:local_logger:Epoch[036/300], Step[1200/1602], Avg Loss: 4.0866, Avg Acc: 0.2054
+INFO:local_logger:Epoch[036/300], Step[1200/1602], Avg Loss: 4.0703, Avg Acc: 0.2107
+INFO:master_logger:Epoch[036/300], Step[1200/1602], Avg Loss: 4.0767, Avg Acc: 0.2052
+INFO:local_logger:Epoch[036/300], Step[1250/1602], Avg Loss: 4.0751, Avg Acc: 0.2100
+INFO:local_logger:Epoch[036/300], Step[1250/1602], Avg Loss: 4.0721, Avg Acc: 0.2054
+INFO:local_logger:Epoch[036/300], Step[1250/1602], Avg Loss: 4.0826, Avg Acc: 0.2060
+INFO:local_logger:Epoch[036/300], Step[1250/1602], Avg Loss: 4.0812, Avg Acc: 0.1967
+INFO:master_logger:Epoch[036/300], Step[1250/1602], Avg Loss: 4.0777, Avg Acc: 0.2045
+INFO:local_logger:Epoch[036/300], Step[1300/1602], Avg Loss: 4.0698, Avg Acc: 0.2067
+INFO:local_logger:Epoch[036/300], Step[1300/1602], Avg Loss: 4.0802, Avg Acc: 0.1967
+INFO:master_logger:Epoch[036/300], Step[1300/1602], Avg Loss: 4.0763, Avg Acc: 0.2048
+INFO:local_logger:Epoch[036/300], Step[1300/1602], Avg Loss: 4.0868, Avg Acc: 0.2058
+INFO:local_logger:Epoch[036/300], Step[1300/1602], Avg Loss: 4.0684, Avg Acc: 0.2099
+INFO:local_logger:Epoch[036/300], Step[1350/1602], Avg Loss: 4.0698, Avg Acc: 0.2073
+INFO:local_logger:Epoch[036/300], Step[1350/1602], Avg Loss: 4.0676, Avg Acc: 0.2091
+INFO:local_logger:Epoch[036/300], Step[1350/1602], Avg Loss: 4.0810, Avg Acc: 0.1964
+INFO:local_logger:Epoch[036/300], Step[1350/1602], Avg Loss: 4.0791, Avg Acc: 0.2050
+INFO:master_logger:Epoch[036/300], Step[1350/1602], Avg Loss: 4.0744, Avg Acc: 0.2044
+INFO:local_logger:Epoch[036/300], Step[1400/1602], Avg Loss: 4.0685, Avg Acc: 0.2074
+INFO:local_logger:Epoch[036/300], Step[1400/1602], Avg Loss: 4.0807, Avg Acc: 0.1962
+INFO:master_logger:Epoch[036/300], Step[1400/1602], Avg Loss: 4.0718, Avg Acc: 0.2049
+INFO:local_logger:Epoch[036/300], Step[1400/1602], Avg Loss: 4.0616, Avg Acc: 0.2102
+INFO:local_logger:Epoch[036/300], Step[1400/1602], Avg Loss: 4.0766, Avg Acc: 0.2056
+INFO:local_logger:Epoch[036/300], Step[1450/1602], Avg Loss: 4.0642, Avg Acc: 0.2110
+INFO:local_logger:Epoch[036/300], Step[1450/1602], Avg Loss: 4.0688, Avg Acc: 0.2075
+INFO:local_logger:Epoch[036/300], Step[1450/1602], Avg Loss: 4.0781, Avg Acc: 0.2049
+INFO:local_logger:Epoch[036/300], Step[1450/1602], Avg Loss: 4.0813, Avg Acc: 0.1961
+INFO:master_logger:Epoch[036/300], Step[1450/1602], Avg Loss: 4.0731, Avg Acc: 0.2049
+INFO:local_logger:Epoch[036/300], Step[1500/1602], Avg Loss: 4.0806, Avg Acc: 0.1961
+INFO:local_logger:Epoch[036/300], Step[1500/1602], Avg Loss: 4.0778, Avg Acc: 0.2040
+INFO:local_logger:Epoch[036/300], Step[1500/1602], Avg Loss: 4.0753, Avg Acc: 0.2074
+INFO:local_logger:Epoch[036/300], Step[1500/1602], Avg Loss: 4.0671, Avg Acc: 0.2092
+INFO:master_logger:Epoch[036/300], Step[1500/1602], Avg Loss: 4.0752, Avg Acc: 0.2042
+INFO:local_logger:Epoch[036/300], Step[1550/1602], Avg Loss: 4.0649, Avg Acc: 0.2088
+INFO:local_logger:Epoch[036/300], Step[1550/1602], Avg Loss: 4.0751, Avg Acc: 0.2067
+INFO:local_logger:Epoch[036/300], Step[1550/1602], Avg Loss: 4.0848, Avg Acc: 0.1963
+INFO:local_logger:Epoch[036/300], Step[1550/1602], Avg Loss: 4.0781, Avg Acc: 0.2046
+INFO:master_logger:Epoch[036/300], Step[1550/1602], Avg Loss: 4.0757, Avg Acc: 0.2041
+INFO:local_logger:Epoch[036/300], Step[1600/1602], Avg Loss: 4.0882, Avg Acc: 0.1949
+INFO:local_logger:Epoch[036/300], Step[1600/1602], Avg Loss: 4.0767, Avg Acc: 0.2044
+INFO:local_logger:Epoch[036/300], Step[1600/1602], Avg Loss: 4.0771, Avg Acc: 0.2075
+INFO:master_logger:Epoch[036/300], Step[1600/1602], Avg Loss: 4.0764, Avg Acc: 0.2040
+INFO:local_logger:Epoch[036/300], Step[1600/1602], Avg Loss: 4.0637, Avg Acc: 0.2091
+INFO:local_logger:----- Epoch[036/300], Train Loss: 4.0770, Train Acc: 0.2075, time: 3706.39
+INFO:master_logger:----- Epoch[036/300], Train Loss: 4.0764, Train Acc: 0.2040, time: 3706.39
+INFO:local_logger:----- Epoch[036/300], Train Loss: 4.0883, Train Acc: 0.1949, time: 3706.82
+INFO:local_logger:----- Epoch[036/300], Train Loss: 4.0767, Train Acc: 0.2044, time: 3706.82
+INFO:local_logger:Now training epoch 37. LR=0.000380
+INFO:local_logger:Now training epoch 37. LR=0.000380
+INFO:local_logger:----- Epoch[036/300], Train Loss: 4.0636, Train Acc: 0.2092, time: 3706.82
+INFO:local_logger:Now training epoch 37. LR=0.000380
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-36-Loss-4.07702053028395.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-36-Loss-4.07702053028395.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-36-Loss-4.07702053028395-EMA.pdparams
+INFO:local_logger:Now training epoch 37. LR=0.000380
+INFO:master_logger:Now training epoch 37. LR=0.000380
+INFO:local_logger:Epoch[037/300], Step[0000/1602], Avg Loss: 4.6116, Avg Acc: 0.2000
+INFO:local_logger:Epoch[037/300], Step[0000/1602], Avg Loss: 4.6065, Avg Acc: 0.1900
+INFO:master_logger:Epoch[037/300], Step[0000/1602], Avg Loss: 4.2410, Avg Acc: 0.2588
+INFO:local_logger:Epoch[037/300], Step[0000/1602], Avg Loss: 3.5514, Avg Acc: 0.3450
+INFO:local_logger:Epoch[037/300], Step[0000/1602], Avg Loss: 4.1946, Avg Acc: 0.3000
+INFO:local_logger:Epoch[037/300], Step[0050/1602], Avg Loss: 4.0310, Avg Acc: 0.2192
+INFO:local_logger:Epoch[037/300], Step[0050/1602], Avg Loss: 4.1452, Avg Acc: 0.1972
+INFO:local_logger:Epoch[037/300], Step[0050/1602], Avg Loss: 4.0458, Avg Acc: 0.2219
+INFO:local_logger:Epoch[037/300], Step[0050/1602], Avg Loss: 4.1845, Avg Acc: 0.1911
+INFO:master_logger:Epoch[037/300], Step[0050/1602], Avg Loss: 4.1016, Avg Acc: 0.2073
+INFO:local_logger:Epoch[037/300], Step[0100/1602], Avg Loss: 4.0672, Avg Acc: 0.2094
+INFO:local_logger:Epoch[037/300], Step[0100/1602], Avg Loss: 4.0889, Avg Acc: 0.2101
+INFO:local_logger:Epoch[037/300], Step[0100/1602], Avg Loss: 4.1552, Avg Acc: 0.1934
+INFO:local_logger:Epoch[037/300], Step[0100/1602], Avg Loss: 4.1223, Avg Acc: 0.1955
+INFO:master_logger:Epoch[037/300], Step[0100/1602], Avg Loss: 4.1084, Avg Acc: 0.2021
+INFO:local_logger:Epoch[037/300], Step[0150/1602], Avg Loss: 4.0625, Avg Acc: 0.2046
+INFO:local_logger:Epoch[037/300], Step[0150/1602], Avg Loss: 4.0902, Avg Acc: 0.2039
+INFO:local_logger:Epoch[037/300], Step[0150/1602], Avg Loss: 4.0967, Avg Acc: 0.1965
+INFO:master_logger:Epoch[037/300], Step[0150/1602], Avg Loss: 4.0923, Avg Acc: 0.1989
+INFO:local_logger:Epoch[037/300], Step[0150/1602], Avg Loss: 4.1199, Avg Acc: 0.1905
+INFO:local_logger:Epoch[037/300], Step[0200/1602], Avg Loss: 4.0820, Avg Acc: 0.1979
+INFO:local_logger:Epoch[037/300], Step[0200/1602], Avg Loss: 4.0979, Avg Acc: 0.2077
+INFO:local_logger:Epoch[037/300], Step[0200/1602], Avg Loss: 4.0602, Avg Acc: 0.2085
+INFO:local_logger:Epoch[037/300], Step[0200/1602], Avg Loss: 4.0831, Avg Acc: 0.2036
+INFO:master_logger:Epoch[037/300], Step[0200/1602], Avg Loss: 4.0808, Avg Acc: 0.2044
+INFO:local_logger:Epoch[037/300], Step[0250/1602], Avg Loss: 4.1267, Avg Acc: 0.2022
+INFO:master_logger:Epoch[037/300], Step[0250/1602], Avg Loss: 4.0904, Avg Acc: 0.2017
+INFO:local_logger:Epoch[037/300], Step[0250/1602], Avg Loss: 4.0558, Avg Acc: 0.2093
+INFO:local_logger:Epoch[037/300], Step[0250/1602], Avg Loss: 4.0955, Avg Acc: 0.2019
+INFO:local_logger:Epoch[037/300], Step[0250/1602], Avg Loss: 4.0835, Avg Acc: 0.1936
+INFO:local_logger:Epoch[037/300], Step[0300/1602], Avg Loss: 4.1230, Avg Acc: 0.1979
+INFO:local_logger:Epoch[037/300], Step[0300/1602], Avg Loss: 4.0645, Avg Acc: 0.2047
+INFO:local_logger:Epoch[037/300], Step[0300/1602], Avg Loss: 4.0879, Avg Acc: 0.2022
+INFO:local_logger:Epoch[037/300], Step[0300/1602], Avg Loss: 4.0748, Avg Acc: 0.1931
+INFO:master_logger:Epoch[037/300], Step[0300/1602], Avg Loss: 4.0876, Avg Acc: 0.1994
+INFO:local_logger:Epoch[037/300], Step[0350/1602], Avg Loss: 4.1305, Avg Acc: 0.1954
+INFO:local_logger:Epoch[037/300], Step[0350/1602], Avg Loss: 4.0626, Avg Acc: 0.1932
+INFO:local_logger:Epoch[037/300], Step[0350/1602], Avg Loss: 4.0734, Avg Acc: 0.2036
+INFO:local_logger:Epoch[037/300], Step[0350/1602], Avg Loss: 4.0612, Avg Acc: 0.2095
+INFO:master_logger:Epoch[037/300], Step[0350/1602], Avg Loss: 4.0819, Avg Acc: 0.2004
+INFO:local_logger:Epoch[037/300], Step[0400/1602], Avg Loss: 4.0538, Avg Acc: 0.2034
+INFO:local_logger:Epoch[037/300], Step[0400/1602], Avg Loss: 4.0672, Avg Acc: 0.2090
+INFO:local_logger:Epoch[037/300], Step[0400/1602], Avg Loss: 4.1362, Avg Acc: 0.1948
+INFO:master_logger:Epoch[037/300], Step[0400/1602], Avg Loss: 4.0805, Avg Acc: 0.1997
+INFO:local_logger:Epoch[037/300], Step[0400/1602], Avg Loss: 4.0649, Avg Acc: 0.1915
+INFO:local_logger:Epoch[037/300], Step[0450/1602], Avg Loss: 4.1180, Avg Acc: 0.1951
+INFO:local_logger:Epoch[037/300], Step[0450/1602], Avg Loss: 4.0669, Avg Acc: 0.1927
+INFO:local_logger:Epoch[037/300], Step[0450/1602], Avg Loss: 4.0614, Avg Acc: 0.2115
+INFO:local_logger:Epoch[037/300], Step[0450/1602], Avg Loss: 4.0459, Avg Acc: 0.2094
+INFO:master_logger:Epoch[037/300], Step[0450/1602], Avg Loss: 4.0731, Avg Acc: 0.2022
+INFO:local_logger:Epoch[037/300], Step[0500/1602], Avg Loss: 4.1157, Avg Acc: 0.1954
+INFO:local_logger:Epoch[037/300], Step[0500/1602], Avg Loss: 4.0451, Avg Acc: 0.2057
+INFO:local_logger:Epoch[037/300], Step[0500/1602], Avg Loss: 4.0553, Avg Acc: 0.1957
+INFO:local_logger:Epoch[037/300], Step[0500/1602], Avg Loss: 4.0582, Avg Acc: 0.2095
+INFO:master_logger:Epoch[037/300], Step[0500/1602], Avg Loss: 4.0686, Avg Acc: 0.2016
+INFO:local_logger:Epoch[037/300], Step[0550/1602], Avg Loss: 4.1161, Avg Acc: 0.1932
+INFO:local_logger:Epoch[037/300], Step[0550/1602], Avg Loss: 4.0515, Avg Acc: 0.2060
+INFO:local_logger:Epoch[037/300], Step[0550/1602], Avg Loss: 4.0590, Avg Acc: 0.1914
+INFO:master_logger:Epoch[037/300], Step[0550/1602], Avg Loss: 4.0674, Avg Acc: 0.1998
+INFO:local_logger:Epoch[037/300], Step[0550/1602], Avg Loss: 4.0431, Avg Acc: 0.2086
+INFO:local_logger:Epoch[037/300], Step[0600/1602], Avg Loss: 4.0535, Avg Acc: 0.2069
+INFO:local_logger:Epoch[037/300], Step[0600/1602], Avg Loss: 4.1245, Avg Acc: 0.1922
+INFO:local_logger:Epoch[037/300], Step[0600/1602], Avg Loss: 4.0620, Avg Acc: 0.1909
+INFO:local_logger:Epoch[037/300], Step[0600/1602], Avg Loss: 4.0353, Avg Acc: 0.2086
+INFO:master_logger:Epoch[037/300], Step[0600/1602], Avg Loss: 4.0688, Avg Acc: 0.1996
+INFO:local_logger:Epoch[037/300], Step[0650/1602], Avg Loss: 4.1294, Avg Acc: 0.1919
+INFO:local_logger:Epoch[037/300], Step[0650/1602], Avg Loss: 4.0478, Avg Acc: 0.2065
+INFO:local_logger:Epoch[037/300], Step[0650/1602], Avg Loss: 4.0569, Avg Acc: 0.2064
+INFO:master_logger:Epoch[037/300], Step[0650/1602], Avg Loss: 4.0751, Avg Acc: 0.1993
+INFO:local_logger:Epoch[037/300], Step[0650/1602], Avg Loss: 4.0662, Avg Acc: 0.1925
+INFO:local_logger:Epoch[037/300], Step[0700/1602], Avg Loss: 4.0733, Avg Acc: 0.1929
+INFO:local_logger:Epoch[037/300], Step[0700/1602], Avg Loss: 4.1287, Avg Acc: 0.1936
+INFO:local_logger:Epoch[037/300], Step[0700/1602], Avg Loss: 4.0493, Avg Acc: 0.2082
+INFO:local_logger:Epoch[037/300], Step[0700/1602], Avg Loss: 4.0434, Avg Acc: 0.2076
+INFO:master_logger:Epoch[037/300], Step[0700/1602], Avg Loss: 4.0736, Avg Acc: 0.2006
+INFO:local_logger:Epoch[037/300], Step[0750/1602], Avg Loss: 4.1246, Avg Acc: 0.1928
+INFO:local_logger:Epoch[037/300], Step[0750/1602], Avg Loss: 4.0463, Avg Acc: 0.2086
+INFO:local_logger:Epoch[037/300], Step[0750/1602], Avg Loss: 4.0492, Avg Acc: 0.2071
+INFO:master_logger:Epoch[037/300], Step[0750/1602], Avg Loss: 4.0749, Avg Acc: 0.2005
+INFO:local_logger:Epoch[037/300], Step[0750/1602], Avg Loss: 4.0795, Avg Acc: 0.1934
+INFO:local_logger:Epoch[037/300], Step[0800/1602], Avg Loss: 4.1178, Avg Acc: 0.1930
+INFO:local_logger:Epoch[037/300], Step[0800/1602], Avg Loss: 4.0512, Avg Acc: 0.2074
+INFO:local_logger:Epoch[037/300], Step[0800/1602], Avg Loss: 4.0485, Avg Acc: 0.2077
+INFO:master_logger:Epoch[037/300], Step[0800/1602], Avg Loss: 4.0743, Avg Acc: 0.2003
+INFO:local_logger:Epoch[037/300], Step[0800/1602], Avg Loss: 4.0799, Avg Acc: 0.1930
+INFO:local_logger:Epoch[037/300], Step[0850/1602], Avg Loss: 4.1145, Avg Acc: 0.1950
+INFO:local_logger:Epoch[037/300], Step[0850/1602], Avg Loss: 4.0524, Avg Acc: 0.2060
+INFO:local_logger:Epoch[037/300], Step[0850/1602], Avg Loss: 4.0772, Avg Acc: 0.1937
+INFO:master_logger:Epoch[037/300], Step[0850/1602], Avg Loss: 4.0724, Avg Acc: 0.2005
+INFO:local_logger:Epoch[037/300], Step[0850/1602], Avg Loss: 4.0457, Avg Acc: 0.2075
+INFO:local_logger:Epoch[037/300], Step[0900/1602], Avg Loss: 4.1088, Avg Acc: 0.1949
+INFO:local_logger:Epoch[037/300], Step[0900/1602], Avg Loss: 4.0483, Avg Acc: 0.2071
+INFO:local_logger:Epoch[037/300], Step[0900/1602], Avg Loss: 4.0755, Avg Acc: 0.1943
+INFO:master_logger:Epoch[037/300], Step[0900/1602], Avg Loss: 4.0686, Avg Acc: 0.2014
+INFO:local_logger:Epoch[037/300], Step[0900/1602], Avg Loss: 4.0418, Avg Acc: 0.2094
+INFO:local_logger:Epoch[037/300], Step[0950/1602], Avg Loss: 4.0456, Avg Acc: 0.2061
+INFO:local_logger:Epoch[037/300], Step[0950/1602], Avg Loss: 4.0486, Avg Acc: 0.2084
+INFO:local_logger:Epoch[037/300], Step[0950/1602], Avg Loss: 4.1038, Avg Acc: 0.1958
+INFO:local_logger:Epoch[037/300], Step[0950/1602], Avg Loss: 4.0719, Avg Acc: 0.1950
+INFO:master_logger:Epoch[037/300], Step[0950/1602], Avg Loss: 4.0675, Avg Acc: 0.2013
+INFO:local_logger:Epoch[037/300], Step[1000/1602], Avg Loss: 4.1027, Avg Acc: 0.1978
+INFO:local_logger:Epoch[037/300], Step[1000/1602], Avg Loss: 4.0533, Avg Acc: 0.2082
+INFO:local_logger:Epoch[037/300], Step[1000/1602], Avg Loss: 4.0523, Avg Acc: 0.2071
+INFO:master_logger:Epoch[037/300], Step[1000/1602], Avg Loss: 4.0717, Avg Acc: 0.2023
+INFO:local_logger:Epoch[037/300], Step[1000/1602], Avg Loss: 4.0785, Avg Acc: 0.1962
+INFO:local_logger:Epoch[037/300], Step[1050/1602], Avg Loss: 4.0793, Avg Acc: 0.1958
+INFO:local_logger:Epoch[037/300], Step[1050/1602], Avg Loss: 4.1076, Avg Acc: 0.1976
+INFO:local_logger:Epoch[037/300], Step[1050/1602], Avg Loss: 4.0484, Avg Acc: 0.2093
+INFO:local_logger:Epoch[037/300], Step[1050/1602], Avg Loss: 4.0610, Avg Acc: 0.2057
+INFO:master_logger:Epoch[037/300], Step[1050/1602], Avg Loss: 4.0741, Avg Acc: 0.2021
+INFO:local_logger:Epoch[037/300], Step[1100/1602], Avg Loss: 4.1043, Avg Acc: 0.1965
+INFO:master_logger:Epoch[037/300], Step[1100/1602], Avg Loss: 4.0709, Avg Acc: 0.2024
+INFO:local_logger:Epoch[037/300], Step[1100/1602], Avg Loss: 4.0470, Avg Acc: 0.2101
+INFO:local_logger:Epoch[037/300], Step[1100/1602], Avg Loss: 4.0549, Avg Acc: 0.2071
+INFO:local_logger:Epoch[037/300], Step[1100/1602], Avg Loss: 4.0774, Avg Acc: 0.1958
+INFO:local_logger:Epoch[037/300], Step[1150/1602], Avg Loss: 4.0965, Avg Acc: 0.1976
+INFO:local_logger:Epoch[037/300], Step[1150/1602], Avg Loss: 4.0778, Avg Acc: 0.1944
+INFO:local_logger:Epoch[037/300], Step[1150/1602], Avg Loss: 4.0474, Avg Acc: 0.2092
+INFO:local_logger:Epoch[037/300], Step[1150/1602], Avg Loss: 4.0530, Avg Acc: 0.2063
+INFO:master_logger:Epoch[037/300], Step[1150/1602], Avg Loss: 4.0687, Avg Acc: 0.2019
+INFO:local_logger:Epoch[037/300], Step[1200/1602], Avg Loss: 4.0507, Avg Acc: 0.2080
+INFO:local_logger:Epoch[037/300], Step[1200/1602], Avg Loss: 4.0939, Avg Acc: 0.1976
+INFO:local_logger:Epoch[037/300], Step[1200/1602], Avg Loss: 4.0510, Avg Acc: 0.2073
+INFO:local_logger:Epoch[037/300], Step[1200/1602], Avg Loss: 4.0829, Avg Acc: 0.1952
+INFO:master_logger:Epoch[037/300], Step[1200/1602], Avg Loss: 4.0696, Avg Acc: 0.2020
+INFO:local_logger:Epoch[037/300], Step[1250/1602], Avg Loss: 4.0932, Avg Acc: 0.1973
+INFO:local_logger:Epoch[037/300], Step[1250/1602], Avg Loss: 4.0525, Avg Acc: 0.2069
+INFO:local_logger:Epoch[037/300], Step[1250/1602], Avg Loss: 4.0513, Avg Acc: 0.2083
+INFO:local_logger:Epoch[037/300], Step[1250/1602], Avg Loss: 4.0814, Avg Acc: 0.1956
+INFO:master_logger:Epoch[037/300], Step[1250/1602], Avg Loss: 4.0696, Avg Acc: 0.2020
+INFO:local_logger:Epoch[037/300], Step[1300/1602], Avg Loss: 4.0906, Avg Acc: 0.1981
+INFO:local_logger:Epoch[037/300], Step[1300/1602], Avg Loss: 4.0501, Avg Acc: 0.2084
+INFO:local_logger:Epoch[037/300], Step[1300/1602], Avg Loss: 4.0814, Avg Acc: 0.1947
+INFO:local_logger:Epoch[037/300], Step[1300/1602], Avg Loss: 4.0539, Avg Acc: 0.2076
+INFO:master_logger:Epoch[037/300], Step[1300/1602], Avg Loss: 4.0690, Avg Acc: 0.2022
+INFO:local_logger:Epoch[037/300], Step[1350/1602], Avg Loss: 4.0906, Avg Acc: 0.1981
+INFO:local_logger:Epoch[037/300], Step[1350/1602], Avg Loss: 4.0812, Avg Acc: 0.1940
+INFO:local_logger:Epoch[037/300], Step[1350/1602], Avg Loss: 4.0515, Avg Acc: 0.2080
+INFO:local_logger:Epoch[037/300], Step[1350/1602], Avg Loss: 4.0566, Avg Acc: 0.2078
+INFO:master_logger:Epoch[037/300], Step[1350/1602], Avg Loss: 4.0700, Avg Acc: 0.2020
+INFO:local_logger:Epoch[037/300], Step[1400/1602], Avg Loss: 4.0916, Avg Acc: 0.1987
+INFO:local_logger:Epoch[037/300], Step[1400/1602], Avg Loss: 4.0830, Avg Acc: 0.1938
+INFO:local_logger:Epoch[037/300], Step[1400/1602], Avg Loss: 4.0563, Avg Acc: 0.2073
+INFO:master_logger:Epoch[037/300], Step[1400/1602], Avg Loss: 4.0699, Avg Acc: 0.2020
+INFO:local_logger:Epoch[037/300], Step[1400/1602], Avg Loss: 4.0489, Avg Acc: 0.2085
+INFO:local_logger:Epoch[037/300], Step[1450/1602], Avg Loss: 4.0859, Avg Acc: 0.1939
+INFO:local_logger:Epoch[037/300], Step[1450/1602], Avg Loss: 4.0883, Avg Acc: 0.1987
+INFO:local_logger:Epoch[037/300], Step[1450/1602], Avg Loss: 4.0478, Avg Acc: 0.2091
+INFO:local_logger:Epoch[037/300], Step[1450/1602], Avg Loss: 4.0508, Avg Acc: 0.2081
+INFO:master_logger:Epoch[037/300], Step[1450/1602], Avg Loss: 4.0682, Avg Acc: 0.2024
+INFO:local_logger:Epoch[037/300], Step[1500/1602], Avg Loss: 4.0863, Avg Acc: 0.1984
+INFO:local_logger:Epoch[037/300], Step[1500/1602], Avg Loss: 4.0466, Avg Acc: 0.2086
+INFO:local_logger:Epoch[037/300], Step[1500/1602], Avg Loss: 4.0468, Avg Acc: 0.2094
+INFO:master_logger:Epoch[037/300], Step[1500/1602], Avg Loss: 4.0661, Avg Acc: 0.2030
+INFO:local_logger:Epoch[037/300], Step[1500/1602], Avg Loss: 4.0846, Avg Acc: 0.1956
+INFO:local_logger:Epoch[037/300], Step[1550/1602], Avg Loss: 4.0843, Avg Acc: 0.1976
+INFO:local_logger:Epoch[037/300], Step[1550/1602], Avg Loss: 4.0882, Avg Acc: 0.1951
+INFO:local_logger:Epoch[037/300], Step[1550/1602], Avg Loss: 4.0487, Avg Acc: 0.2082
+INFO:local_logger:Epoch[037/300], Step[1550/1602], Avg Loss: 4.0431, Avg Acc: 0.2089
+INFO:master_logger:Epoch[037/300], Step[1550/1602], Avg Loss: 4.0661, Avg Acc: 0.2025
+INFO:local_logger:Epoch[037/300], Step[1600/1602], Avg Loss: 4.0482, Avg Acc: 0.2083
+INFO:local_logger:Epoch[037/300], Step[1600/1602], Avg Loss: 4.0826, Avg Acc: 0.1977
+INFO:local_logger:Epoch[037/300], Step[1600/1602], Avg Loss: 4.0413, Avg Acc: 0.2094
+INFO:master_logger:Epoch[037/300], Step[1600/1602], Avg Loss: 4.0647, Avg Acc: 0.2027
+INFO:local_logger:Epoch[037/300], Step[1600/1602], Avg Loss: 4.0868, Avg Acc: 0.1955
+INFO:local_logger:----- Epoch[037/300], Train Loss: 4.0482, Train Acc: 0.2084, time: 3715.08
+INFO:local_logger:Now training epoch 38. LR=0.000379
+INFO:local_logger:----- Epoch[037/300], Train Loss: 4.0827, Train Acc: 0.1977, time: 3715.36
+INFO:local_logger:----- Epoch[037/300], Train Loss: 4.0415, Train Acc: 0.2094, time: 3715.43
+INFO:local_logger:Now training epoch 38. LR=0.000379
+INFO:master_logger:----- Epoch[037/300], Train Loss: 4.0648, Train Acc: 0.2028, time: 3715.36
+INFO:local_logger:----- Epoch[037/300], Train Loss: 4.0869, Train Acc: 0.1956, time: 3715.43
+INFO:local_logger:Now training epoch 38. LR=0.000379
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-37-Loss-4.082746218156971.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-37-Loss-4.082746218156971.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-37-Loss-4.082746218156971-EMA.pdparams
+INFO:local_logger:Now training epoch 38. LR=0.000379
+INFO:master_logger:Now training epoch 38. LR=0.000379
+INFO:local_logger:Epoch[038/300], Step[0000/1602], Avg Loss: 4.4291, Avg Acc: 0.2400
+INFO:local_logger:Epoch[038/300], Step[0000/1602], Avg Loss: 4.1491, Avg Acc: 0.2850
+INFO:local_logger:Epoch[038/300], Step[0000/1602], Avg Loss: 3.6355, Avg Acc: 0.3700
+INFO:master_logger:Epoch[038/300], Step[0000/1602], Avg Loss: 3.8337, Avg Acc: 0.3150
+INFO:local_logger:Epoch[038/300], Step[0000/1602], Avg Loss: 3.1212, Avg Acc: 0.3650
+INFO:local_logger:Epoch[038/300], Step[0050/1602], Avg Loss: 4.0804, Avg Acc: 0.1918
+INFO:local_logger:Epoch[038/300], Step[0050/1602], Avg Loss: 4.1606, Avg Acc: 0.2002
+INFO:local_logger:Epoch[038/300], Step[0050/1602], Avg Loss: 4.0127, Avg Acc: 0.2463
+INFO:master_logger:Epoch[038/300], Step[0050/1602], Avg Loss: 4.0734, Avg Acc: 0.2145
+INFO:local_logger:Epoch[038/300], Step[0050/1602], Avg Loss: 4.0398, Avg Acc: 0.2197
+INFO:local_logger:Epoch[038/300], Step[0100/1602], Avg Loss: 4.1132, Avg Acc: 0.1907
+INFO:local_logger:Epoch[038/300], Step[0100/1602], Avg Loss: 4.1389, Avg Acc: 0.1876
+INFO:local_logger:Epoch[038/300], Step[0100/1602], Avg Loss: 4.0265, Avg Acc: 0.2210
+INFO:master_logger:Epoch[038/300], Step[0100/1602], Avg Loss: 4.0972, Avg Acc: 0.2007
+INFO:local_logger:Epoch[038/300], Step[0100/1602], Avg Loss: 4.1100, Avg Acc: 0.2034
+INFO:local_logger:Epoch[038/300], Step[0150/1602], Avg Loss: 4.1146, Avg Acc: 0.1869
+INFO:local_logger:Epoch[038/300], Step[0150/1602], Avg Loss: 4.1112, Avg Acc: 0.1883
+INFO:local_logger:Epoch[038/300], Step[0150/1602], Avg Loss: 4.0803, Avg Acc: 0.2139
+INFO:local_logger:Epoch[038/300], Step[0150/1602], Avg Loss: 4.0668, Avg Acc: 0.2209
+INFO:master_logger:Epoch[038/300], Step[0150/1602], Avg Loss: 4.0932, Avg Acc: 0.2025
+INFO:local_logger:Epoch[038/300], Step[0200/1602], Avg Loss: 4.0977, Avg Acc: 0.1988
+INFO:local_logger:Epoch[038/300], Step[0200/1602], Avg Loss: 4.0383, Avg Acc: 0.2165
+INFO:local_logger:Epoch[038/300], Step[0200/1602], Avg Loss: 4.0615, Avg Acc: 0.2198
+INFO:local_logger:Epoch[038/300], Step[0200/1602], Avg Loss: 4.0994, Avg Acc: 0.1944
+INFO:master_logger:Epoch[038/300], Step[0200/1602], Avg Loss: 4.0742, Avg Acc: 0.2074
+INFO:local_logger:Epoch[038/300], Step[0250/1602], Avg Loss: 4.0922, Avg Acc: 0.2029
+INFO:local_logger:Epoch[038/300], Step[0250/1602], Avg Loss: 4.0861, Avg Acc: 0.1936
+INFO:local_logger:Epoch[038/300], Step[0250/1602], Avg Loss: 4.0971, Avg Acc: 0.2128
+INFO:master_logger:Epoch[038/300], Step[0250/1602], Avg Loss: 4.0766, Avg Acc: 0.2064
+INFO:local_logger:Epoch[038/300], Step[0250/1602], Avg Loss: 4.0311, Avg Acc: 0.2164
+INFO:local_logger:Epoch[038/300], Step[0300/1602], Avg Loss: 4.0818, Avg Acc: 0.2073
+INFO:local_logger:Epoch[038/300], Step[0300/1602], Avg Loss: 4.1105, Avg Acc: 0.2129
+INFO:local_logger:Epoch[038/300], Step[0300/1602], Avg Loss: 4.0772, Avg Acc: 0.1982
+INFO:master_logger:Epoch[038/300], Step[0300/1602], Avg Loss: 4.0724, Avg Acc: 0.2097
+INFO:local_logger:Epoch[038/300], Step[0300/1602], Avg Loss: 4.0201, Avg Acc: 0.2204
+INFO:local_logger:Epoch[038/300], Step[0350/1602], Avg Loss: 4.0847, Avg Acc: 0.2078
+INFO:local_logger:Epoch[038/300], Step[0350/1602], Avg Loss: 4.1132, Avg Acc: 0.2120
+INFO:local_logger:Epoch[038/300], Step[0350/1602], Avg Loss: 4.0135, Avg Acc: 0.2266
+INFO:local_logger:Epoch[038/300], Step[0350/1602], Avg Loss: 4.0760, Avg Acc: 0.2029
+INFO:master_logger:Epoch[038/300], Step[0350/1602], Avg Loss: 4.0719, Avg Acc: 0.2123
+INFO:local_logger:Epoch[038/300], Step[0400/1602], Avg Loss: 4.0811, Avg Acc: 0.2073
+INFO:master_logger:Epoch[038/300], Step[0400/1602], Avg Loss: 4.0795, Avg Acc: 0.2112
+INFO:local_logger:Epoch[038/300], Step[0400/1602], Avg Loss: 4.0912, Avg Acc: 0.1989
+INFO:local_logger:Epoch[038/300], Step[0400/1602], Avg Loss: 4.0158, Avg Acc: 0.2284
+INFO:local_logger:Epoch[038/300], Step[0400/1602], Avg Loss: 4.1299, Avg Acc: 0.2102
+INFO:local_logger:Epoch[038/300], Step[0450/1602], Avg Loss: 4.0672, Avg Acc: 0.2070
+INFO:local_logger:Epoch[038/300], Step[0450/1602], Avg Loss: 4.1176, Avg Acc: 0.2049
+INFO:master_logger:Epoch[038/300], Step[0450/1602], Avg Loss: 4.0738, Avg Acc: 0.2093
+INFO:local_logger:Epoch[038/300], Step[0450/1602], Avg Loss: 4.0859, Avg Acc: 0.1991
+INFO:local_logger:Epoch[038/300], Step[0450/1602], Avg Loss: 4.0246, Avg Acc: 0.2262
+INFO:local_logger:Epoch[038/300], Step[0500/1602], Avg Loss: 4.0656, Avg Acc: 0.2059
+INFO:local_logger:Epoch[038/300], Step[0500/1602], Avg Loss: 4.0759, Avg Acc: 0.1996
+INFO:local_logger:Epoch[038/300], Step[0500/1602], Avg Loss: 4.0215, Avg Acc: 0.2249
+INFO:local_logger:Epoch[038/300], Step[0500/1602], Avg Loss: 4.1282, Avg Acc: 0.2031
+INFO:master_logger:Epoch[038/300], Step[0500/1602], Avg Loss: 4.0728, Avg Acc: 0.2084
+INFO:local_logger:Epoch[038/300], Step[0550/1602], Avg Loss: 4.0668, Avg Acc: 0.2047
+INFO:local_logger:Epoch[038/300], Step[0550/1602], Avg Loss: 4.0303, Avg Acc: 0.2212
+INFO:local_logger:Epoch[038/300], Step[0550/1602], Avg Loss: 4.0655, Avg Acc: 0.2029
+INFO:master_logger:Epoch[038/300], Step[0550/1602], Avg Loss: 4.0713, Avg Acc: 0.2076
+INFO:local_logger:Epoch[038/300], Step[0550/1602], Avg Loss: 4.1227, Avg Acc: 0.2017
+INFO:local_logger:Epoch[038/300], Step[0600/1602], Avg Loss: 4.0726, Avg Acc: 0.2036
+INFO:local_logger:Epoch[038/300], Step[0600/1602], Avg Loss: 4.0663, Avg Acc: 0.2037
+INFO:local_logger:Epoch[038/300], Step[0600/1602], Avg Loss: 4.0320, Avg Acc: 0.2198
+INFO:local_logger:Epoch[038/300], Step[0600/1602], Avg Loss: 4.1114, Avg Acc: 0.2049
+INFO:master_logger:Epoch[038/300], Step[0600/1602], Avg Loss: 4.0706, Avg Acc: 0.2080
+INFO:local_logger:Epoch[038/300], Step[0650/1602], Avg Loss: 4.0623, Avg Acc: 0.2062
+INFO:local_logger:Epoch[038/300], Step[0650/1602], Avg Loss: 4.0630, Avg Acc: 0.2050
+INFO:local_logger:Epoch[038/300], Step[0650/1602], Avg Loss: 4.0390, Avg Acc: 0.2184
+INFO:master_logger:Epoch[038/300], Step[0650/1602], Avg Loss: 4.0667, Avg Acc: 0.2087
+INFO:local_logger:Epoch[038/300], Step[0650/1602], Avg Loss: 4.1025, Avg Acc: 0.2053
+INFO:local_logger:Epoch[038/300], Step[0700/1602], Avg Loss: 4.0663, Avg Acc: 0.2043
+INFO:local_logger:Epoch[038/300], Step[0700/1602], Avg Loss: 4.0459, Avg Acc: 0.2184
+INFO:local_logger:Epoch[038/300], Step[0700/1602], Avg Loss: 4.0559, Avg Acc: 0.2048
+INFO:local_logger:Epoch[038/300], Step[0700/1602], Avg Loss: 4.1015, Avg Acc: 0.2068
+INFO:master_logger:Epoch[038/300], Step[0700/1602], Avg Loss: 4.0674, Avg Acc: 0.2086
+INFO:local_logger:Epoch[038/300], Step[0750/1602], Avg Loss: 4.0653, Avg Acc: 0.2017
+INFO:local_logger:Epoch[038/300], Step[0750/1602], Avg Loss: 4.0417, Avg Acc: 0.2055
+INFO:local_logger:Epoch[038/300], Step[0750/1602], Avg Loss: 4.0450, Avg Acc: 0.2163
+INFO:local_logger:Epoch[038/300], Step[0750/1602], Avg Loss: 4.1026, Avg Acc: 0.2076
+INFO:master_logger:Epoch[038/300], Step[0750/1602], Avg Loss: 4.0637, Avg Acc: 0.2078
+INFO:local_logger:Epoch[038/300], Step[0800/1602], Avg Loss: 4.0721, Avg Acc: 0.2012
+INFO:local_logger:Epoch[038/300], Step[0800/1602], Avg Loss: 4.0427, Avg Acc: 0.2068
+INFO:master_logger:Epoch[038/300], Step[0800/1602], Avg Loss: 4.0673, Avg Acc: 0.2071
+INFO:local_logger:Epoch[038/300], Step[0800/1602], Avg Loss: 4.0444, Avg Acc: 0.2155
+INFO:local_logger:Epoch[038/300], Step[0800/1602], Avg Loss: 4.1101, Avg Acc: 0.2048
+INFO:local_logger:Epoch[038/300], Step[0850/1602], Avg Loss: 4.0715, Avg Acc: 0.2009
+INFO:local_logger:Epoch[038/300], Step[0850/1602], Avg Loss: 4.0480, Avg Acc: 0.2063
+INFO:local_logger:Epoch[038/300], Step[0850/1602], Avg Loss: 4.1108, Avg Acc: 0.2055
+INFO:master_logger:Epoch[038/300], Step[0850/1602], Avg Loss: 4.0700, Avg Acc: 0.2069
+INFO:local_logger:Epoch[038/300], Step[0850/1602], Avg Loss: 4.0497, Avg Acc: 0.2150
+INFO:local_logger:Epoch[038/300], Step[0900/1602], Avg Loss: 4.0642, Avg Acc: 0.2021
+INFO:local_logger:Epoch[038/300], Step[0900/1602], Avg Loss: 4.1068, Avg Acc: 0.2063
+INFO:local_logger:Epoch[038/300], Step[0900/1602], Avg Loss: 4.0474, Avg Acc: 0.2066
+INFO:master_logger:Epoch[038/300], Step[0900/1602], Avg Loss: 4.0698, Avg Acc: 0.2069
+INFO:local_logger:Epoch[038/300], Step[0900/1602], Avg Loss: 4.0607, Avg Acc: 0.2127
+INFO:local_logger:Epoch[038/300], Step[0950/1602], Avg Loss: 4.0672, Avg Acc: 0.2014
+INFO:local_logger:Epoch[038/300], Step[0950/1602], Avg Loss: 4.0523, Avg Acc: 0.2063
+INFO:master_logger:Epoch[038/300], Step[0950/1602], Avg Loss: 4.0685, Avg Acc: 0.2059
+INFO:local_logger:Epoch[038/300], Step[0950/1602], Avg Loss: 4.0538, Avg Acc: 0.2100
+INFO:local_logger:Epoch[038/300], Step[0950/1602], Avg Loss: 4.1007, Avg Acc: 0.2058
+INFO:local_logger:Epoch[038/300], Step[1000/1602], Avg Loss: 4.0647, Avg Acc: 0.2030
+INFO:local_logger:Epoch[038/300], Step[1000/1602], Avg Loss: 4.0508, Avg Acc: 0.2061
+INFO:local_logger:Epoch[038/300], Step[1000/1602], Avg Loss: 4.0542, Avg Acc: 0.2079
+INFO:master_logger:Epoch[038/300], Step[1000/1602], Avg Loss: 4.0682, Avg Acc: 0.2059
+INFO:local_logger:Epoch[038/300], Step[1000/1602], Avg Loss: 4.1029, Avg Acc: 0.2064
+INFO:local_logger:Epoch[038/300], Step[1050/1602], Avg Loss: 4.0614, Avg Acc: 0.2027
+INFO:local_logger:Epoch[038/300], Step[1050/1602], Avg Loss: 4.0488, Avg Acc: 0.2067
+INFO:local_logger:Epoch[038/300], Step[1050/1602], Avg Loss: 4.0489, Avg Acc: 0.2093
+INFO:local_logger:Epoch[038/300], Step[1050/1602], Avg Loss: 4.1034, Avg Acc: 0.2073
+INFO:master_logger:Epoch[038/300], Step[1050/1602], Avg Loss: 4.0656, Avg Acc: 0.2065
+INFO:local_logger:Epoch[038/300], Step[1100/1602], Avg Loss: 4.0555, Avg Acc: 0.2045
+INFO:local_logger:Epoch[038/300], Step[1100/1602], Avg Loss: 4.0478, Avg Acc: 0.2073
+INFO:master_logger:Epoch[038/300], Step[1100/1602], Avg Loss: 4.0607, Avg Acc: 0.2076
+INFO:local_logger:Epoch[038/300], Step[1100/1602], Avg Loss: 4.0944, Avg Acc: 0.2086
+INFO:local_logger:Epoch[038/300], Step[1100/1602], Avg Loss: 4.0450, Avg Acc: 0.2098
+INFO:local_logger:Epoch[038/300], Step[1150/1602], Avg Loss: 4.0551, Avg Acc: 0.2030
+INFO:local_logger:Epoch[038/300], Step[1150/1602], Avg Loss: 4.0861, Avg Acc: 0.2075
+INFO:local_logger:Epoch[038/300], Step[1150/1602], Avg Loss: 4.0483, Avg Acc: 0.2094
+INFO:local_logger:Epoch[038/300], Step[1150/1602], Avg Loss: 4.0521, Avg Acc: 0.2078
+INFO:master_logger:Epoch[038/300], Step[1150/1602], Avg Loss: 4.0604, Avg Acc: 0.2069
+INFO:local_logger:Epoch[038/300], Step[1200/1602], Avg Loss: 4.0537, Avg Acc: 0.2040
+INFO:local_logger:Epoch[038/300], Step[1200/1602], Avg Loss: 4.0566, Avg Acc: 0.2081
+INFO:local_logger:Epoch[038/300], Step[1200/1602], Avg Loss: 4.0394, Avg Acc: 0.2093
+INFO:local_logger:Epoch[038/300], Step[1200/1602], Avg Loss: 4.0823, Avg Acc: 0.2069
+INFO:master_logger:Epoch[038/300], Step[1200/1602], Avg Loss: 4.0580, Avg Acc: 0.2071
+INFO:local_logger:Epoch[038/300], Step[1250/1602], Avg Loss: 4.0540, Avg Acc: 0.2055
+INFO:local_logger:Epoch[038/300], Step[1250/1602], Avg Loss: 4.0793, Avg Acc: 0.2068
+INFO:local_logger:Epoch[038/300], Step[1250/1602], Avg Loss: 4.0512, Avg Acc: 0.2094
+INFO:local_logger:Epoch[038/300], Step[1250/1602], Avg Loss: 4.0357, Avg Acc: 0.2099
+INFO:master_logger:Epoch[038/300], Step[1250/1602], Avg Loss: 4.0550, Avg Acc: 0.2079
+INFO:local_logger:Epoch[038/300], Step[1300/1602], Avg Loss: 4.0522, Avg Acc: 0.2060
+INFO:local_logger:Epoch[038/300], Step[1300/1602], Avg Loss: 4.0360, Avg Acc: 0.2108
+INFO:local_logger:Epoch[038/300], Step[1300/1602], Avg Loss: 4.0742, Avg Acc: 0.2070
+INFO:local_logger:Epoch[038/300], Step[1300/1602], Avg Loss: 4.0453, Avg Acc: 0.2088
+INFO:master_logger:Epoch[038/300], Step[1300/1602], Avg Loss: 4.0519, Avg Acc: 0.2081
+INFO:local_logger:Epoch[038/300], Step[1350/1602], Avg Loss: 4.0435, Avg Acc: 0.2089
+INFO:local_logger:Epoch[038/300], Step[1350/1602], Avg Loss: 4.0517, Avg Acc: 0.2059
+INFO:master_logger:Epoch[038/300], Step[1350/1602], Avg Loss: 4.0493, Avg Acc: 0.2076
+INFO:local_logger:Epoch[038/300], Step[1350/1602], Avg Loss: 4.0678, Avg Acc: 0.2064
+INFO:local_logger:Epoch[038/300], Step[1350/1602], Avg Loss: 4.0341, Avg Acc: 0.2092
+INFO:local_logger:Epoch[038/300], Step[1400/1602], Avg Loss: 4.0523, Avg Acc: 0.2050
+INFO:local_logger:Epoch[038/300], Step[1400/1602], Avg Loss: 4.0325, Avg Acc: 0.2090
+INFO:local_logger:Epoch[038/300], Step[1400/1602], Avg Loss: 4.0713, Avg Acc: 0.2050
+INFO:local_logger:Epoch[038/300], Step[1400/1602], Avg Loss: 4.0391, Avg Acc: 0.2085
+INFO:master_logger:Epoch[038/300], Step[1400/1602], Avg Loss: 4.0488, Avg Acc: 0.2069
+INFO:local_logger:Epoch[038/300], Step[1450/1602], Avg Loss: 4.0519, Avg Acc: 0.2053
+INFO:local_logger:Epoch[038/300], Step[1450/1602], Avg Loss: 4.0368, Avg Acc: 0.2092
+INFO:local_logger:Epoch[038/300], Step[1450/1602], Avg Loss: 4.0326, Avg Acc: 0.2096
+INFO:master_logger:Epoch[038/300], Step[1450/1602], Avg Loss: 4.0479, Avg Acc: 0.2075
+INFO:local_logger:Epoch[038/300], Step[1450/1602], Avg Loss: 4.0704, Avg Acc: 0.2057
+INFO:local_logger:Epoch[038/300], Step[1500/1602], Avg Loss: 4.0509, Avg Acc: 0.2059
+INFO:local_logger:Epoch[038/300], Step[1500/1602], Avg Loss: 4.0359, Avg Acc: 0.2096
+INFO:master_logger:Epoch[038/300], Step[1500/1602], Avg Loss: 4.0474, Avg Acc: 0.2079
+INFO:local_logger:Epoch[038/300], Step[1500/1602], Avg Loss: 4.0707, Avg Acc: 0.2065
+INFO:local_logger:Epoch[038/300], Step[1500/1602], Avg Loss: 4.0321, Avg Acc: 0.2096
+INFO:local_logger:Epoch[038/300], Step[1550/1602], Avg Loss: 4.0381, Avg Acc: 0.2087
+INFO:local_logger:Epoch[038/300], Step[1550/1602], Avg Loss: 4.0554, Avg Acc: 0.2054
+INFO:local_logger:Epoch[038/300], Step[1550/1602], Avg Loss: 4.0335, Avg Acc: 0.2086
+INFO:master_logger:Epoch[038/300], Step[1550/1602], Avg Loss: 4.0495, Avg Acc: 0.2075
+INFO:local_logger:Epoch[038/300], Step[1550/1602], Avg Loss: 4.0712, Avg Acc: 0.2073
+INFO:local_logger:Epoch[038/300], Step[1600/1602], Avg Loss: 4.0577, Avg Acc: 0.2056
+INFO:local_logger:Epoch[038/300], Step[1600/1602], Avg Loss: 4.0350, Avg Acc: 0.2083
+INFO:master_logger:Epoch[038/300], Step[1600/1602], Avg Loss: 4.0497, Avg Acc: 0.2073
+INFO:local_logger:Epoch[038/300], Step[1600/1602], Avg Loss: 4.0373, Avg Acc: 0.2085
+INFO:local_logger:Epoch[038/300], Step[1600/1602], Avg Loss: 4.0686, Avg Acc: 0.2070
+INFO:local_logger:----- Epoch[038/300], Train Loss: 4.0578, Train Acc: 0.2056, time: 3707.66
+INFO:master_logger:----- Epoch[038/300], Train Loss: 4.0497, Train Acc: 0.2073, time: 3707.66
+INFO:local_logger:----- Epoch[038/300], Train Loss: 4.0372, Train Acc: 0.2085, time: 3707.92
+INFO:local_logger:Now training epoch 39. LR=0.000378
+INFO:local_logger:----- Epoch[038/300], Train Loss: 4.0351, Train Acc: 0.2082, time: 3708.28
+INFO:local_logger:Now training epoch 39. LR=0.000378
+INFO:local_logger:----- Epoch[038/300], Train Loss: 4.0685, Train Acc: 0.2070, time: 3707.93
+INFO:local_logger:Now training epoch 39. LR=0.000378
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-38-Loss-4.05782875058272.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-38-Loss-4.05782875058272.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-38-Loss-4.05782875058272-EMA.pdparams
+INFO:local_logger:Now training epoch 39. LR=0.000378
+INFO:master_logger:Now training epoch 39. LR=0.000378
+INFO:local_logger:Epoch[039/300], Step[0000/1602], Avg Loss: 3.8859, Avg Acc: 0.3300
+INFO:local_logger:Epoch[039/300], Step[0000/1602], Avg Loss: 4.6346, Avg Acc: 0.2000
+INFO:local_logger:Epoch[039/300], Step[0000/1602], Avg Loss: 3.5365, Avg Acc: 0.0050
+INFO:local_logger:Epoch[039/300], Step[0000/1602], Avg Loss: 4.4419, Avg Acc: 0.0100
+INFO:master_logger:Epoch[039/300], Step[0000/1602], Avg Loss: 4.1247, Avg Acc: 0.1363
+INFO:local_logger:Epoch[039/300], Step[0050/1602], Avg Loss: 4.0246, Avg Acc: 0.1950
+INFO:local_logger:Epoch[039/300], Step[0050/1602], Avg Loss: 3.9517, Avg Acc: 0.1983
+INFO:local_logger:Epoch[039/300], Step[0050/1602], Avg Loss: 4.1128, Avg Acc: 0.1500
+INFO:master_logger:Epoch[039/300], Step[0050/1602], Avg Loss: 4.0399, Avg Acc: 0.1859
+INFO:local_logger:Epoch[039/300], Step[0050/1602], Avg Loss: 4.0707, Avg Acc: 0.2002
+INFO:local_logger:Epoch[039/300], Step[0100/1602], Avg Loss: 4.1263, Avg Acc: 0.1769
+INFO:local_logger:Epoch[039/300], Step[0100/1602], Avg Loss: 4.0017, Avg Acc: 0.2002
+INFO:local_logger:Epoch[039/300], Step[0100/1602], Avg Loss: 4.0823, Avg Acc: 0.2020
+INFO:local_logger:Epoch[039/300], Step[0100/1602], Avg Loss: 4.0610, Avg Acc: 0.1811
+INFO:master_logger:Epoch[039/300], Step[0100/1602], Avg Loss: 4.0678, Avg Acc: 0.1901
+INFO:local_logger:Epoch[039/300], Step[0150/1602], Avg Loss: 4.0904, Avg Acc: 0.1911
+INFO:local_logger:Epoch[039/300], Step[0150/1602], Avg Loss: 4.0390, Avg Acc: 0.2129
+INFO:local_logger:Epoch[039/300], Step[0150/1602], Avg Loss: 4.0555, Avg Acc: 0.1972
+INFO:local_logger:Epoch[039/300], Step[0150/1602], Avg Loss: 3.9987, Avg Acc: 0.2098
+INFO:master_logger:Epoch[039/300], Step[0150/1602], Avg Loss: 4.0459, Avg Acc: 0.2027
+INFO:local_logger:Epoch[039/300], Step[0200/1602], Avg Loss: 4.0686, Avg Acc: 0.1917
+INFO:master_logger:Epoch[039/300], Step[0200/1602], Avg Loss: 4.0461, Avg Acc: 0.2045
+INFO:local_logger:Epoch[039/300], Step[0200/1602], Avg Loss: 4.0544, Avg Acc: 0.2112
+INFO:local_logger:Epoch[039/300], Step[0200/1602], Avg Loss: 4.0430, Avg Acc: 0.2062
+INFO:local_logger:Epoch[039/300], Step[0200/1602], Avg Loss: 4.0183, Avg Acc: 0.2089
+INFO:local_logger:Epoch[039/300], Step[0250/1602], Avg Loss: 4.0446, Avg Acc: 0.2142
+INFO:local_logger:Epoch[039/300], Step[0250/1602], Avg Loss: 4.0567, Avg Acc: 0.1978
+INFO:master_logger:Epoch[039/300], Step[0250/1602], Avg Loss: 4.0439, Avg Acc: 0.2087
+INFO:local_logger:Epoch[039/300], Step[0250/1602], Avg Loss: 4.0786, Avg Acc: 0.2039
+INFO:local_logger:Epoch[039/300], Step[0250/1602], Avg Loss: 3.9956, Avg Acc: 0.2188
+INFO:local_logger:Epoch[039/300], Step[0300/1602], Avg Loss: 4.0690, Avg Acc: 0.1963
+INFO:local_logger:Epoch[039/300], Step[0300/1602], Avg Loss: 4.0112, Avg Acc: 0.2144
+INFO:local_logger:Epoch[039/300], Step[0300/1602], Avg Loss: 4.0625, Avg Acc: 0.2090
+INFO:master_logger:Epoch[039/300], Step[0300/1602], Avg Loss: 4.0552, Avg Acc: 0.2072
+INFO:local_logger:Epoch[039/300], Step[0300/1602], Avg Loss: 4.0779, Avg Acc: 0.2090
+INFO:local_logger:Epoch[039/300], Step[0350/1602], Avg Loss: 4.0826, Avg Acc: 0.1953
+INFO:local_logger:Epoch[039/300], Step[0350/1602], Avg Loss: 4.0767, Avg Acc: 0.2099
+INFO:local_logger:Epoch[039/300], Step[0350/1602], Avg Loss: 4.0613, Avg Acc: 0.2087
+INFO:local_logger:Epoch[039/300], Step[0350/1602], Avg Loss: 4.0134, Avg Acc: 0.2134
+INFO:master_logger:Epoch[039/300], Step[0350/1602], Avg Loss: 4.0585, Avg Acc: 0.2068
+INFO:local_logger:Epoch[039/300], Step[0400/1602], Avg Loss: 4.0768, Avg Acc: 0.2108
+INFO:local_logger:Epoch[039/300], Step[0400/1602], Avg Loss: 4.0613, Avg Acc: 0.1957
+INFO:local_logger:Epoch[039/300], Step[0400/1602], Avg Loss: 4.0613, Avg Acc: 0.2085
+INFO:local_logger:Epoch[039/300], Step[0400/1602], Avg Loss: 4.0200, Avg Acc: 0.2153
+INFO:master_logger:Epoch[039/300], Step[0400/1602], Avg Loss: 4.0549, Avg Acc: 0.2076
+INFO:local_logger:Epoch[039/300], Step[0450/1602], Avg Loss: 4.0655, Avg Acc: 0.1967
+INFO:local_logger:Epoch[039/300], Step[0450/1602], Avg Loss: 4.0665, Avg Acc: 0.2053
+INFO:local_logger:Epoch[039/300], Step[0450/1602], Avg Loss: 4.0764, Avg Acc: 0.2115
+INFO:master_logger:Epoch[039/300], Step[0450/1602], Avg Loss: 4.0595, Avg Acc: 0.2065
+INFO:local_logger:Epoch[039/300], Step[0450/1602], Avg Loss: 4.0297, Avg Acc: 0.2126
+INFO:local_logger:Epoch[039/300], Step[0500/1602], Avg Loss: 4.0669, Avg Acc: 0.1973
+INFO:local_logger:Epoch[039/300], Step[0500/1602], Avg Loss: 4.0674, Avg Acc: 0.2048
+INFO:local_logger:Epoch[039/300], Step[0500/1602], Avg Loss: 4.0231, Avg Acc: 0.2143
+INFO:master_logger:Epoch[039/300], Step[0500/1602], Avg Loss: 4.0556, Avg Acc: 0.2078
+INFO:local_logger:Epoch[039/300], Step[0500/1602], Avg Loss: 4.0649, Avg Acc: 0.2148
+INFO:local_logger:Epoch[039/300], Step[0550/1602], Avg Loss: 4.0693, Avg Acc: 0.1951
+INFO:local_logger:Epoch[039/300], Step[0550/1602], Avg Loss: 4.0719, Avg Acc: 0.2036
+INFO:local_logger:Epoch[039/300], Step[0550/1602], Avg Loss: 4.0312, Avg Acc: 0.2124
+INFO:master_logger:Epoch[039/300], Step[0550/1602], Avg Loss: 4.0578, Avg Acc: 0.2062
+INFO:local_logger:Epoch[039/300], Step[0550/1602], Avg Loss: 4.0588, Avg Acc: 0.2137
+INFO:local_logger:Epoch[039/300], Step[0600/1602], Avg Loss: 4.0695, Avg Acc: 0.2059
+INFO:local_logger:Epoch[039/300], Step[0600/1602], Avg Loss: 4.0526, Avg Acc: 0.1972
+INFO:master_logger:Epoch[039/300], Step[0600/1602], Avg Loss: 4.0522, Avg Acc: 0.2075
+INFO:local_logger:Epoch[039/300], Step[0600/1602], Avg Loss: 4.0369, Avg Acc: 0.2155
+INFO:local_logger:Epoch[039/300], Step[0600/1602], Avg Loss: 4.0500, Avg Acc: 0.2116
+INFO:local_logger:Epoch[039/300], Step[0650/1602], Avg Loss: 4.0508, Avg Acc: 0.1996
+INFO:local_logger:Epoch[039/300], Step[0650/1602], Avg Loss: 4.0703, Avg Acc: 0.2067
+INFO:local_logger:Epoch[039/300], Step[0650/1602], Avg Loss: 4.0494, Avg Acc: 0.2094
+INFO:local_logger:Epoch[039/300], Step[0650/1602], Avg Loss: 4.0388, Avg Acc: 0.2131
+INFO:master_logger:Epoch[039/300], Step[0650/1602], Avg Loss: 4.0523, Avg Acc: 0.2072
+INFO:local_logger:Epoch[039/300], Step[0700/1602], Avg Loss: 4.0461, Avg Acc: 0.1971
+INFO:local_logger:Epoch[039/300], Step[0700/1602], Avg Loss: 4.0468, Avg Acc: 0.2090
+INFO:local_logger:Epoch[039/300], Step[0700/1602], Avg Loss: 4.0715, Avg Acc: 0.2038
+INFO:local_logger:Epoch[039/300], Step[0700/1602], Avg Loss: 4.0298, Avg Acc: 0.2129
+INFO:master_logger:Epoch[039/300], Step[0700/1602], Avg Loss: 4.0486, Avg Acc: 0.2057
+INFO:local_logger:Epoch[039/300], Step[0750/1602], Avg Loss: 4.0423, Avg Acc: 0.1972
+INFO:local_logger:Epoch[039/300], Step[0750/1602], Avg Loss: 4.0703, Avg Acc: 0.2044
+INFO:local_logger:Epoch[039/300], Step[0750/1602], Avg Loss: 4.0347, Avg Acc: 0.2139
+INFO:master_logger:Epoch[039/300], Step[0750/1602], Avg Loss: 4.0480, Avg Acc: 0.2060
+INFO:local_logger:Epoch[039/300], Step[0750/1602], Avg Loss: 4.0448, Avg Acc: 0.2087
+INFO:local_logger:Epoch[039/300], Step[0800/1602], Avg Loss: 4.0492, Avg Acc: 0.2082
+INFO:local_logger:Epoch[039/300], Step[0800/1602], Avg Loss: 4.0706, Avg Acc: 0.2041
+INFO:local_logger:Epoch[039/300], Step[0800/1602], Avg Loss: 4.0425, Avg Acc: 0.1983
+INFO:local_logger:Epoch[039/300], Step[0800/1602], Avg Loss: 4.0390, Avg Acc: 0.2126
+INFO:master_logger:Epoch[039/300], Step[0800/1602], Avg Loss: 4.0503, Avg Acc: 0.2058
+INFO:local_logger:Epoch[039/300], Step[0850/1602], Avg Loss: 4.0508, Avg Acc: 0.2079
+INFO:local_logger:Epoch[039/300], Step[0850/1602], Avg Loss: 4.0441, Avg Acc: 0.1997
+INFO:local_logger:Epoch[039/300], Step[0850/1602], Avg Loss: 4.0332, Avg Acc: 0.2130
+INFO:local_logger:Epoch[039/300], Step[0850/1602], Avg Loss: 4.0717, Avg Acc: 0.2035
+INFO:master_logger:Epoch[039/300], Step[0850/1602], Avg Loss: 4.0499, Avg Acc: 0.2060
+INFO:local_logger:Epoch[039/300], Step[0900/1602], Avg Loss: 4.0364, Avg Acc: 0.2132
+INFO:local_logger:Epoch[039/300], Step[0900/1602], Avg Loss: 4.0467, Avg Acc: 0.2017
+INFO:local_logger:Epoch[039/300], Step[0900/1602], Avg Loss: 4.0641, Avg Acc: 0.2040
+INFO:local_logger:Epoch[039/300], Step[0900/1602], Avg Loss: 4.0462, Avg Acc: 0.2085
+INFO:master_logger:Epoch[039/300], Step[0900/1602], Avg Loss: 4.0484, Avg Acc: 0.2068
+INFO:local_logger:Epoch[039/300], Step[0950/1602], Avg Loss: 4.0478, Avg Acc: 0.2012
+INFO:local_logger:Epoch[039/300], Step[0950/1602], Avg Loss: 4.0390, Avg Acc: 0.2124
+INFO:local_logger:Epoch[039/300], Step[0950/1602], Avg Loss: 4.0470, Avg Acc: 0.2065
+INFO:master_logger:Epoch[039/300], Step[0950/1602], Avg Loss: 4.0488, Avg Acc: 0.2059
+INFO:local_logger:Epoch[039/300], Step[0950/1602], Avg Loss: 4.0613, Avg Acc: 0.2034
+INFO:local_logger:Epoch[039/300], Step[1000/1602], Avg Loss: 4.0497, Avg Acc: 0.2013
+INFO:local_logger:Epoch[039/300], Step[1000/1602], Avg Loss: 4.0442, Avg Acc: 0.2113
+INFO:local_logger:Epoch[039/300], Step[1000/1602], Avg Loss: 4.0480, Avg Acc: 0.2073
+INFO:local_logger:Epoch[039/300], Step[1000/1602], Avg Loss: 4.0593, Avg Acc: 0.2038
+INFO:master_logger:Epoch[039/300], Step[1000/1602], Avg Loss: 4.0503, Avg Acc: 0.2059
+INFO:local_logger:Epoch[039/300], Step[1050/1602], Avg Loss: 4.0474, Avg Acc: 0.1999
+INFO:local_logger:Epoch[039/300], Step[1050/1602], Avg Loss: 4.0451, Avg Acc: 0.2107
+INFO:local_logger:Epoch[039/300], Step[1050/1602], Avg Loss: 4.0492, Avg Acc: 0.2072
+INFO:master_logger:Epoch[039/300], Step[1050/1602], Avg Loss: 4.0503, Avg Acc: 0.2053
+INFO:local_logger:Epoch[039/300], Step[1050/1602], Avg Loss: 4.0595, Avg Acc: 0.2033
+INFO:local_logger:Epoch[039/300], Step[1100/1602], Avg Loss: 4.0573, Avg Acc: 0.2032
+INFO:local_logger:Epoch[039/300], Step[1100/1602], Avg Loss: 4.0473, Avg Acc: 0.1998
+INFO:local_logger:Epoch[039/300], Step[1100/1602], Avg Loss: 4.0436, Avg Acc: 0.2063
+INFO:master_logger:Epoch[039/300], Step[1100/1602], Avg Loss: 4.0488, Avg Acc: 0.2050
+INFO:local_logger:Epoch[039/300], Step[1100/1602], Avg Loss: 4.0467, Avg Acc: 0.2108
+INFO:local_logger:Epoch[039/300], Step[1150/1602], Avg Loss: 4.0476, Avg Acc: 0.1991
+INFO:local_logger:Epoch[039/300], Step[1150/1602], Avg Loss: 4.0387, Avg Acc: 0.2115
+INFO:local_logger:Epoch[039/300], Step[1150/1602], Avg Loss: 4.0453, Avg Acc: 0.2069
+INFO:master_logger:Epoch[039/300], Step[1150/1602], Avg Loss: 4.0485, Avg Acc: 0.2052
+INFO:local_logger:Epoch[039/300], Step[1150/1602], Avg Loss: 4.0625, Avg Acc: 0.2032
+INFO:local_logger:Epoch[039/300], Step[1200/1602], Avg Loss: 4.0466, Avg Acc: 0.1999
+INFO:master_logger:Epoch[039/300], Step[1200/1602], Avg Loss: 4.0484, Avg Acc: 0.2053
+INFO:local_logger:Epoch[039/300], Step[1200/1602], Avg Loss: 4.0382, Avg Acc: 0.2119
+INFO:local_logger:Epoch[039/300], Step[1200/1602], Avg Loss: 4.0460, Avg Acc: 0.2067
+INFO:local_logger:Epoch[039/300], Step[1200/1602], Avg Loss: 4.0629, Avg Acc: 0.2027
+INFO:local_logger:Epoch[039/300], Step[1250/1602], Avg Loss: 4.0467, Avg Acc: 0.2002
+INFO:master_logger:Epoch[039/300], Step[1250/1602], Avg Loss: 4.0467, Avg Acc: 0.2050
+INFO:local_logger:Epoch[039/300], Step[1250/1602], Avg Loss: 4.0577, Avg Acc: 0.2024
+INFO:local_logger:Epoch[039/300], Step[1250/1602], Avg Loss: 4.0450, Avg Acc: 0.2050
+INFO:local_logger:Epoch[039/300], Step[1250/1602], Avg Loss: 4.0375, Avg Acc: 0.2125
+INFO:local_logger:Epoch[039/300], Step[1300/1602], Avg Loss: 4.0451, Avg Acc: 0.2058
+INFO:local_logger:Epoch[039/300], Step[1300/1602], Avg Loss: 4.0441, Avg Acc: 0.2022
+INFO:local_logger:Epoch[039/300], Step[1300/1602], Avg Loss: 4.0531, Avg Acc: 0.2037
+INFO:local_logger:Epoch[039/300], Step[1300/1602], Avg Loss: 4.0360, Avg Acc: 0.2127
+INFO:master_logger:Epoch[039/300], Step[1300/1602], Avg Loss: 4.0446, Avg Acc: 0.2061
+INFO:local_logger:Epoch[039/300], Step[1350/1602], Avg Loss: 4.0367, Avg Acc: 0.2030
+INFO:local_logger:Epoch[039/300], Step[1350/1602], Avg Loss: 4.0415, Avg Acc: 0.2127
+INFO:local_logger:Epoch[039/300], Step[1350/1602], Avg Loss: 4.0464, Avg Acc: 0.2057
+INFO:master_logger:Epoch[039/300], Step[1350/1602], Avg Loss: 4.0453, Avg Acc: 0.2063
+INFO:local_logger:Epoch[039/300], Step[1350/1602], Avg Loss: 4.0567, Avg Acc: 0.2039
+INFO:local_logger:Epoch[039/300], Step[1400/1602], Avg Loss: 4.0383, Avg Acc: 0.2041
+INFO:local_logger:Epoch[039/300], Step[1400/1602], Avg Loss: 4.0394, Avg Acc: 0.2137
+INFO:local_logger:Epoch[039/300], Step[1400/1602], Avg Loss: 4.0433, Avg Acc: 0.2056
+INFO:local_logger:Epoch[039/300], Step[1400/1602], Avg Loss: 4.0580, Avg Acc: 0.2044
+INFO:master_logger:Epoch[039/300], Step[1400/1602], Avg Loss: 4.0447, Avg Acc: 0.2069
+INFO:local_logger:Epoch[039/300], Step[1450/1602], Avg Loss: 4.0383, Avg Acc: 0.2066
+INFO:local_logger:Epoch[039/300], Step[1450/1602], Avg Loss: 4.0437, Avg Acc: 0.2135
+INFO:local_logger:Epoch[039/300], Step[1450/1602], Avg Loss: 4.0346, Avg Acc: 0.2050
+INFO:local_logger:Epoch[039/300], Step[1450/1602], Avg Loss: 4.0582, Avg Acc: 0.2038
+INFO:master_logger:Epoch[039/300], Step[1450/1602], Avg Loss: 4.0437, Avg Acc: 0.2072
+INFO:local_logger:Epoch[039/300], Step[1500/1602], Avg Loss: 4.0306, Avg Acc: 0.2062
+INFO:local_logger:Epoch[039/300], Step[1500/1602], Avg Loss: 4.0471, Avg Acc: 0.2133
+INFO:local_logger:Epoch[039/300], Step[1500/1602], Avg Loss: 4.0563, Avg Acc: 0.2052
+INFO:master_logger:Epoch[039/300], Step[1500/1602], Avg Loss: 4.0421, Avg Acc: 0.2080
+INFO:local_logger:Epoch[039/300], Step[1500/1602], Avg Loss: 4.0345, Avg Acc: 0.2075
+INFO:local_logger:Epoch[039/300], Step[1550/1602], Avg Loss: 4.0267, Avg Acc: 0.2063
+INFO:local_logger:Epoch[039/300], Step[1550/1602], Avg Loss: 4.0436, Avg Acc: 0.2138
+INFO:local_logger:Epoch[039/300], Step[1550/1602], Avg Loss: 4.0567, Avg Acc: 0.2049
+INFO:local_logger:Epoch[039/300], Step[1550/1602], Avg Loss: 4.0284, Avg Acc: 0.2090
+INFO:master_logger:Epoch[039/300], Step[1550/1602], Avg Loss: 4.0389, Avg Acc: 0.2085
+INFO:local_logger:Epoch[039/300], Step[1600/1602], Avg Loss: 4.0437, Avg Acc: 0.2132
+INFO:local_logger:Epoch[039/300], Step[1600/1602], Avg Loss: 4.0575, Avg Acc: 0.2047
+INFO:local_logger:Epoch[039/300], Step[1600/1602], Avg Loss: 4.0263, Avg Acc: 0.2060
+INFO:local_logger:Epoch[039/300], Step[1600/1602], Avg Loss: 4.0268, Avg Acc: 0.2095
+INFO:master_logger:Epoch[039/300], Step[1600/1602], Avg Loss: 4.0386, Avg Acc: 0.2083
+INFO:local_logger:----- Epoch[039/300], Train Loss: 4.0265, Train Acc: 0.2059, time: 3702.99
+INFO:master_logger:----- Epoch[039/300], Train Loss: 4.0385, Train Acc: 0.2083, time: 3702.99
+INFO:local_logger:----- Epoch[039/300], Train Loss: 4.0267, Train Acc: 0.2095, time: 3703.24
+INFO:local_logger:Now training epoch 40. LR=0.000377
+INFO:local_logger:----- Epoch[039/300], Train Loss: 4.0574, Train Acc: 0.2047, time: 3703.24
+INFO:local_logger:----- Epoch[039/300], Train Loss: 4.0434, Train Acc: 0.2131, time: 3703.24
+INFO:local_logger:Now training epoch 40. LR=0.000377
+INFO:local_logger:Now training epoch 40. LR=0.000377
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-39-Loss-4.026458652028463.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-39-Loss-4.026458652028463.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-39-Loss-4.026458652028463-EMA.pdparams
+INFO:local_logger:Now training epoch 40. LR=0.000377
+INFO:master_logger:Now training epoch 40. LR=0.000377
+INFO:local_logger:Epoch[040/300], Step[0000/1602], Avg Loss: 4.6628, Avg Acc: 0.1200
+INFO:local_logger:Epoch[040/300], Step[0000/1602], Avg Loss: 3.9394, Avg Acc: 0.3300
+INFO:local_logger:Epoch[040/300], Step[0000/1602], Avg Loss: 3.5539, Avg Acc: 0.0150
+INFO:local_logger:Epoch[040/300], Step[0000/1602], Avg Loss: 3.1963, Avg Acc: 0.4100
+INFO:master_logger:Epoch[040/300], Step[0000/1602], Avg Loss: 3.8381, Avg Acc: 0.2188
+INFO:local_logger:Epoch[040/300], Step[0050/1602], Avg Loss: 3.9922, Avg Acc: 0.2561
+INFO:local_logger:Epoch[040/300], Step[0050/1602], Avg Loss: 4.0659, Avg Acc: 0.2063
+INFO:local_logger:Epoch[040/300], Step[0050/1602], Avg Loss: 4.0158, Avg Acc: 0.2211
+INFO:local_logger:Epoch[040/300], Step[0050/1602], Avg Loss: 4.0163, Avg Acc: 0.2129
+INFO:master_logger:Epoch[040/300], Step[0050/1602], Avg Loss: 4.0225, Avg Acc: 0.2241
+INFO:local_logger:Epoch[040/300], Step[0100/1602], Avg Loss: 4.0276, Avg Acc: 0.2361
+INFO:local_logger:Epoch[040/300], Step[0100/1602], Avg Loss: 4.0193, Avg Acc: 0.2323
+INFO:local_logger:Epoch[040/300], Step[0100/1602], Avg Loss: 4.0748, Avg Acc: 0.2068
+INFO:local_logger:Epoch[040/300], Step[0100/1602], Avg Loss: 4.0107, Avg Acc: 0.2251
+INFO:master_logger:Epoch[040/300], Step[0100/1602], Avg Loss: 4.0331, Avg Acc: 0.2251
+INFO:local_logger:Epoch[040/300], Step[0150/1602], Avg Loss: 4.0358, Avg Acc: 0.2377
+INFO:local_logger:Epoch[040/300], Step[0150/1602], Avg Loss: 3.9941, Avg Acc: 0.2313
+INFO:local_logger:Epoch[040/300], Step[0150/1602], Avg Loss: 4.0543, Avg Acc: 0.2066
+INFO:master_logger:Epoch[040/300], Step[0150/1602], Avg Loss: 4.0305, Avg Acc: 0.2238
+INFO:local_logger:Epoch[040/300], Step[0150/1602], Avg Loss: 4.0379, Avg Acc: 0.2195
+INFO:local_logger:Epoch[040/300], Step[0200/1602], Avg Loss: 4.0657, Avg Acc: 0.2249
+INFO:local_logger:Epoch[040/300], Step[0200/1602], Avg Loss: 4.0476, Avg Acc: 0.2169
+INFO:local_logger:Epoch[040/300], Step[0200/1602], Avg Loss: 4.0761, Avg Acc: 0.2027
+INFO:master_logger:Epoch[040/300], Step[0200/1602], Avg Loss: 4.0524, Avg Acc: 0.2170
+INFO:local_logger:Epoch[040/300], Step[0200/1602], Avg Loss: 4.0200, Avg Acc: 0.2234
+INFO:local_logger:Epoch[040/300], Step[0250/1602], Avg Loss: 4.0659, Avg Acc: 0.2219
+INFO:master_logger:Epoch[040/300], Step[0250/1602], Avg Loss: 4.0389, Avg Acc: 0.2148
+INFO:local_logger:Epoch[040/300], Step[0250/1602], Avg Loss: 4.0296, Avg Acc: 0.2080
+INFO:local_logger:Epoch[040/300], Step[0250/1602], Avg Loss: 4.0113, Avg Acc: 0.2229
+INFO:local_logger:Epoch[040/300], Step[0250/1602], Avg Loss: 4.0487, Avg Acc: 0.2064
+INFO:local_logger:Epoch[040/300], Step[0300/1602], Avg Loss: 4.0262, Avg Acc: 0.2063
+INFO:local_logger:Epoch[040/300], Step[0300/1602], Avg Loss: 4.0528, Avg Acc: 0.2186
+INFO:local_logger:Epoch[040/300], Step[0300/1602], Avg Loss: 3.9996, Avg Acc: 0.2246
+INFO:local_logger:Epoch[040/300], Step[0300/1602], Avg Loss: 4.0290, Avg Acc: 0.2116
+INFO:master_logger:Epoch[040/300], Step[0300/1602], Avg Loss: 4.0269, Avg Acc: 0.2153
+INFO:local_logger:Epoch[040/300], Step[0350/1602], Avg Loss: 3.9938, Avg Acc: 0.2223
+INFO:local_logger:Epoch[040/300], Step[0350/1602], Avg Loss: 4.0324, Avg Acc: 0.2169
+INFO:local_logger:Epoch[040/300], Step[0350/1602], Avg Loss: 4.0098, Avg Acc: 0.2124
+INFO:local_logger:Epoch[040/300], Step[0350/1602], Avg Loss: 4.0387, Avg Acc: 0.2123
+INFO:master_logger:Epoch[040/300], Step[0350/1602], Avg Loss: 4.0187, Avg Acc: 0.2160
+INFO:local_logger:Epoch[040/300], Step[0400/1602], Avg Loss: 3.9873, Avg Acc: 0.2237
+INFO:local_logger:Epoch[040/300], Step[0400/1602], Avg Loss: 4.0485, Avg Acc: 0.2076
+INFO:local_logger:Epoch[040/300], Step[0400/1602], Avg Loss: 4.0320, Avg Acc: 0.2169
+INFO:local_logger:Epoch[040/300], Step[0400/1602], Avg Loss: 4.0104, Avg Acc: 0.2092
+INFO:master_logger:Epoch[040/300], Step[0400/1602], Avg Loss: 4.0195, Avg Acc: 0.2143
+INFO:local_logger:Epoch[040/300], Step[0450/1602], Avg Loss: 4.0239, Avg Acc: 0.2190
+INFO:local_logger:Epoch[040/300], Step[0450/1602], Avg Loss: 4.0038, Avg Acc: 0.2129
+INFO:local_logger:Epoch[040/300], Step[0450/1602], Avg Loss: 4.0545, Avg Acc: 0.2107
+INFO:local_logger:Epoch[040/300], Step[0450/1602], Avg Loss: 3.9864, Avg Acc: 0.2200
+INFO:master_logger:Epoch[040/300], Step[0450/1602], Avg Loss: 4.0172, Avg Acc: 0.2157
+INFO:local_logger:Epoch[040/300], Step[0500/1602], Avg Loss: 4.0234, Avg Acc: 0.2180
+INFO:local_logger:Epoch[040/300], Step[0500/1602], Avg Loss: 4.0045, Avg Acc: 0.2159
+INFO:local_logger:Epoch[040/300], Step[0500/1602], Avg Loss: 3.9846, Avg Acc: 0.2199
+INFO:local_logger:Epoch[040/300], Step[0500/1602], Avg Loss: 4.0482, Avg Acc: 0.2095
+INFO:master_logger:Epoch[040/300], Step[0500/1602], Avg Loss: 4.0152, Avg Acc: 0.2158
+INFO:local_logger:Epoch[040/300], Step[0550/1602], Avg Loss: 4.0044, Avg Acc: 0.2183
+INFO:local_logger:Epoch[040/300], Step[0550/1602], Avg Loss: 4.0188, Avg Acc: 0.2197
+INFO:local_logger:Epoch[040/300], Step[0550/1602], Avg Loss: 3.9826, Avg Acc: 0.2185
+INFO:master_logger:Epoch[040/300], Step[0550/1602], Avg Loss: 4.0142, Avg Acc: 0.2161
+INFO:local_logger:Epoch[040/300], Step[0550/1602], Avg Loss: 4.0512, Avg Acc: 0.2078
+INFO:local_logger:Epoch[040/300], Step[0600/1602], Avg Loss: 4.0117, Avg Acc: 0.2175
+INFO:master_logger:Epoch[040/300], Step[0600/1602], Avg Loss: 4.0158, Avg Acc: 0.2146
+INFO:local_logger:Epoch[040/300], Step[0600/1602], Avg Loss: 4.0539, Avg Acc: 0.2076
+INFO:local_logger:Epoch[040/300], Step[0600/1602], Avg Loss: 3.9904, Avg Acc: 0.2160
+INFO:local_logger:Epoch[040/300], Step[0600/1602], Avg Loss: 4.0070, Avg Acc: 0.2172
+INFO:local_logger:Epoch[040/300], Step[0650/1602], Avg Loss: 3.9950, Avg Acc: 0.2139
+INFO:local_logger:Epoch[040/300], Step[0650/1602], Avg Loss: 4.0082, Avg Acc: 0.2190
+INFO:local_logger:Epoch[040/300], Step[0650/1602], Avg Loss: 4.0049, Avg Acc: 0.2181
+INFO:local_logger:Epoch[040/300], Step[0650/1602], Avg Loss: 4.0584, Avg Acc: 0.2064
+INFO:master_logger:Epoch[040/300], Step[0650/1602], Avg Loss: 4.0166, Avg Acc: 0.2143
+INFO:local_logger:Epoch[040/300], Step[0700/1602], Avg Loss: 4.0094, Avg Acc: 0.2185
+INFO:master_logger:Epoch[040/300], Step[0700/1602], Avg Loss: 4.0191, Avg Acc: 0.2141
+INFO:local_logger:Epoch[040/300], Step[0700/1602], Avg Loss: 4.0067, Avg Acc: 0.2180
+INFO:local_logger:Epoch[040/300], Step[0700/1602], Avg Loss: 4.0026, Avg Acc: 0.2130
+INFO:local_logger:Epoch[040/300], Step[0700/1602], Avg Loss: 4.0575, Avg Acc: 0.2069
+INFO:local_logger:Epoch[040/300], Step[0750/1602], Avg Loss: 4.0006, Avg Acc: 0.2129
+INFO:local_logger:Epoch[040/300], Step[0750/1602], Avg Loss: 4.0521, Avg Acc: 0.2067
+INFO:local_logger:Epoch[040/300], Step[0750/1602], Avg Loss: 4.0125, Avg Acc: 0.2175
+INFO:local_logger:Epoch[040/300], Step[0750/1602], Avg Loss: 4.0020, Avg Acc: 0.2202
+INFO:master_logger:Epoch[040/300], Step[0750/1602], Avg Loss: 4.0168, Avg Acc: 0.2143
+INFO:local_logger:Epoch[040/300], Step[0800/1602], Avg Loss: 4.0080, Avg Acc: 0.2183
+INFO:local_logger:Epoch[040/300], Step[0800/1602], Avg Loss: 3.9967, Avg Acc: 0.2213
+INFO:local_logger:Epoch[040/300], Step[0800/1602], Avg Loss: 4.0444, Avg Acc: 0.2085
+INFO:local_logger:Epoch[040/300], Step[0800/1602], Avg Loss: 4.0018, Avg Acc: 0.2139
+INFO:master_logger:Epoch[040/300], Step[0800/1602], Avg Loss: 4.0127, Avg Acc: 0.2155
+INFO:local_logger:Epoch[040/300], Step[0850/1602], Avg Loss: 4.0065, Avg Acc: 0.2183
+INFO:local_logger:Epoch[040/300], Step[0850/1602], Avg Loss: 4.0066, Avg Acc: 0.2185
+INFO:local_logger:Epoch[040/300], Step[0850/1602], Avg Loss: 4.0403, Avg Acc: 0.2072
+INFO:local_logger:Epoch[040/300], Step[0850/1602], Avg Loss: 4.0030, Avg Acc: 0.2146
+INFO:master_logger:Epoch[040/300], Step[0850/1602], Avg Loss: 4.0141, Avg Acc: 0.2146
+INFO:local_logger:Epoch[040/300], Step[0900/1602], Avg Loss: 4.0086, Avg Acc: 0.2176
+INFO:local_logger:Epoch[040/300], Step[0900/1602], Avg Loss: 4.0037, Avg Acc: 0.2161
+INFO:local_logger:Epoch[040/300], Step[0900/1602], Avg Loss: 4.0359, Avg Acc: 0.2080
+INFO:local_logger:Epoch[040/300], Step[0900/1602], Avg Loss: 4.0026, Avg Acc: 0.2141
+INFO:master_logger:Epoch[040/300], Step[0900/1602], Avg Loss: 4.0127, Avg Acc: 0.2139
+INFO:local_logger:Epoch[040/300], Step[0950/1602], Avg Loss: 4.0093, Avg Acc: 0.2189
+INFO:local_logger:Epoch[040/300], Step[0950/1602], Avg Loss: 4.0032, Avg Acc: 0.2175
+INFO:local_logger:Epoch[040/300], Step[0950/1602], Avg Loss: 4.0028, Avg Acc: 0.2151
+INFO:local_logger:Epoch[040/300], Step[0950/1602], Avg Loss: 4.0364, Avg Acc: 0.2080
+INFO:master_logger:Epoch[040/300], Step[0950/1602], Avg Loss: 4.0129, Avg Acc: 0.2149
+INFO:local_logger:Epoch[040/300], Step[1000/1602], Avg Loss: 4.0125, Avg Acc: 0.2192
+INFO:local_logger:Epoch[040/300], Step[1000/1602], Avg Loss: 4.0381, Avg Acc: 0.2082
+INFO:local_logger:Epoch[040/300], Step[1000/1602], Avg Loss: 4.0007, Avg Acc: 0.2184
+INFO:local_logger:Epoch[040/300], Step[1000/1602], Avg Loss: 4.0108, Avg Acc: 0.2140
+INFO:master_logger:Epoch[040/300], Step[1000/1602], Avg Loss: 4.0155, Avg Acc: 0.2150
+INFO:local_logger:Epoch[040/300], Step[1050/1602], Avg Loss: 4.0192, Avg Acc: 0.2178
+INFO:local_logger:Epoch[040/300], Step[1050/1602], Avg Loss: 4.0127, Avg Acc: 0.2140
+INFO:local_logger:Epoch[040/300], Step[1050/1602], Avg Loss: 4.0050, Avg Acc: 0.2185
+INFO:local_logger:Epoch[040/300], Step[1050/1602], Avg Loss: 4.0391, Avg Acc: 0.2081
+INFO:master_logger:Epoch[040/300], Step[1050/1602], Avg Loss: 4.0190, Avg Acc: 0.2146
+INFO:local_logger:Epoch[040/300], Step[1100/1602], Avg Loss: 4.0122, Avg Acc: 0.2172
+INFO:local_logger:Epoch[040/300], Step[1100/1602], Avg Loss: 4.0177, Avg Acc: 0.2163
+INFO:local_logger:Epoch[040/300], Step[1100/1602], Avg Loss: 4.0351, Avg Acc: 0.2087
+INFO:local_logger:Epoch[040/300], Step[1100/1602], Avg Loss: 4.0110, Avg Acc: 0.2136
+INFO:master_logger:Epoch[040/300], Step[1100/1602], Avg Loss: 4.0190, Avg Acc: 0.2140
+INFO:local_logger:Epoch[040/300], Step[1150/1602], Avg Loss: 4.0203, Avg Acc: 0.2164
+INFO:local_logger:Epoch[040/300], Step[1150/1602], Avg Loss: 4.0179, Avg Acc: 0.2126
+INFO:local_logger:Epoch[040/300], Step[1150/1602], Avg Loss: 4.0343, Avg Acc: 0.2095
+INFO:local_logger:Epoch[040/300], Step[1150/1602], Avg Loss: 4.0158, Avg Acc: 0.2169
+INFO:master_logger:Epoch[040/300], Step[1150/1602], Avg Loss: 4.0221, Avg Acc: 0.2139
+INFO:local_logger:Epoch[040/300], Step[1200/1602], Avg Loss: 4.0126, Avg Acc: 0.2176
+INFO:local_logger:Epoch[040/300], Step[1200/1602], Avg Loss: 4.0160, Avg Acc: 0.2181
+INFO:local_logger:Epoch[040/300], Step[1200/1602], Avg Loss: 4.0232, Avg Acc: 0.2128
+INFO:master_logger:Epoch[040/300], Step[1200/1602], Avg Loss: 4.0204, Avg Acc: 0.2146
+INFO:local_logger:Epoch[040/300], Step[1200/1602], Avg Loss: 4.0299, Avg Acc: 0.2101
+INFO:local_logger:Epoch[040/300], Step[1250/1602], Avg Loss: 4.0146, Avg Acc: 0.2174
+INFO:local_logger:Epoch[040/300], Step[1250/1602], Avg Loss: 4.0260, Avg Acc: 0.2129
+INFO:local_logger:Epoch[040/300], Step[1250/1602], Avg Loss: 4.0280, Avg Acc: 0.2112
+INFO:local_logger:Epoch[040/300], Step[1250/1602], Avg Loss: 4.0082, Avg Acc: 0.2183
+INFO:master_logger:Epoch[040/300], Step[1250/1602], Avg Loss: 4.0192, Avg Acc: 0.2149
+INFO:local_logger:Epoch[040/300], Step[1300/1602], Avg Loss: 4.0123, Avg Acc: 0.2177
+INFO:local_logger:Epoch[040/300], Step[1300/1602], Avg Loss: 4.0165, Avg Acc: 0.2172
+INFO:local_logger:Epoch[040/300], Step[1300/1602], Avg Loss: 4.0232, Avg Acc: 0.2129
+INFO:local_logger:Epoch[040/300], Step[1300/1602], Avg Loss: 4.0246, Avg Acc: 0.2119
+INFO:master_logger:Epoch[040/300], Step[1300/1602], Avg Loss: 4.0192, Avg Acc: 0.2149
+INFO:local_logger:Epoch[040/300], Step[1350/1602], Avg Loss: 4.0171, Avg Acc: 0.2177
+INFO:local_logger:Epoch[040/300], Step[1350/1602], Avg Loss: 4.0228, Avg Acc: 0.2128
+INFO:local_logger:Epoch[040/300], Step[1350/1602], Avg Loss: 4.0126, Avg Acc: 0.2178
+INFO:master_logger:Epoch[040/300], Step[1350/1602], Avg Loss: 4.0185, Avg Acc: 0.2155
+INFO:local_logger:Epoch[040/300], Step[1350/1602], Avg Loss: 4.0215, Avg Acc: 0.2136
+INFO:local_logger:Epoch[040/300], Step[1400/1602], Avg Loss: 4.0168, Avg Acc: 0.2179
+INFO:local_logger:Epoch[040/300], Step[1400/1602], Avg Loss: 4.0095, Avg Acc: 0.2168
+INFO:local_logger:Epoch[040/300], Step[1400/1602], Avg Loss: 4.0225, Avg Acc: 0.2132
+INFO:master_logger:Epoch[040/300], Step[1400/1602], Avg Loss: 4.0175, Avg Acc: 0.2154
+INFO:local_logger:Epoch[040/300], Step[1400/1602], Avg Loss: 4.0210, Avg Acc: 0.2137
+INFO:local_logger:Epoch[040/300], Step[1450/1602], Avg Loss: 4.0169, Avg Acc: 0.2174
+INFO:local_logger:Epoch[040/300], Step[1450/1602], Avg Loss: 4.0218, Avg Acc: 0.2133
+INFO:local_logger:Epoch[040/300], Step[1450/1602], Avg Loss: 4.0108, Avg Acc: 0.2164
+INFO:local_logger:Epoch[040/300], Step[1450/1602], Avg Loss: 4.0236, Avg Acc: 0.2128
+INFO:master_logger:Epoch[040/300], Step[1450/1602], Avg Loss: 4.0183, Avg Acc: 0.2150
+INFO:local_logger:Epoch[040/300], Step[1500/1602], Avg Loss: 4.0198, Avg Acc: 0.2170
+INFO:local_logger:Epoch[040/300], Step[1500/1602], Avg Loss: 4.0108, Avg Acc: 0.2170
+INFO:local_logger:Epoch[040/300], Step[1500/1602], Avg Loss: 4.0196, Avg Acc: 0.2130
+INFO:local_logger:Epoch[040/300], Step[1500/1602], Avg Loss: 4.0269, Avg Acc: 0.2117
+INFO:master_logger:Epoch[040/300], Step[1500/1602], Avg Loss: 4.0193, Avg Acc: 0.2147
+INFO:local_logger:Epoch[040/300], Step[1550/1602], Avg Loss: 4.0198, Avg Acc: 0.2170
+INFO:local_logger:Epoch[040/300], Step[1550/1602], Avg Loss: 4.0100, Avg Acc: 0.2168
+INFO:local_logger:Epoch[040/300], Step[1550/1602], Avg Loss: 4.0203, Avg Acc: 0.2124
+INFO:local_logger:Epoch[040/300], Step[1550/1602], Avg Loss: 4.0281, Avg Acc: 0.2123
+INFO:master_logger:Epoch[040/300], Step[1550/1602], Avg Loss: 4.0195, Avg Acc: 0.2146
+INFO:local_logger:Epoch[040/300], Step[1600/1602], Avg Loss: 4.0081, Avg Acc: 0.2164
+INFO:local_logger:Epoch[040/300], Step[1600/1602], Avg Loss: 4.0229, Avg Acc: 0.2121
+INFO:local_logger:Epoch[040/300], Step[1600/1602], Avg Loss: 4.0279, Avg Acc: 0.2132
+INFO:local_logger:Epoch[040/300], Step[1600/1602], Avg Loss: 4.0191, Avg Acc: 0.2173
+INFO:master_logger:Epoch[040/300], Step[1600/1602], Avg Loss: 4.0195, Avg Acc: 0.2148
+INFO:local_logger:----- Epoch[040/300], Train Loss: 4.0192, Train Acc: 0.2173, time: 3712.56
+INFO:master_logger:----- Epoch[040/300], Train Loss: 4.0194, Train Acc: 0.2148, time: 3712.56
+INFO:local_logger:----- Validation after Epoch: 40
+INFO:master_logger:----- Validation after Epoch: 40
+INFO:local_logger:----- Epoch[040/300], Train Loss: 4.0079, Train Acc: 0.2164, time: 3712.84
+INFO:local_logger:----- Validation after Epoch: 40
+INFO:local_logger:----- Epoch[040/300], Train Loss: 4.0278, Train Acc: 0.2132, time: 3712.86
+INFO:local_logger:----- Validation after Epoch: 40
+INFO:local_logger:----- Epoch[040/300], Train Loss: 4.0228, Train Acc: 0.2122, time: 3712.90
+INFO:local_logger:----- Validation after Epoch: 40
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.4545, Avg Acc@1: 0.8750, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 1.2969, Avg Acc@1: 0.8750, Avg Acc@5: 0.8750
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.0864, Avg Acc@1: 1.0000, Avg Acc@5: 1.0000
+INFO:local_logger:Val Step[0000/1563], Avg Loss: 0.9598, Avg Acc@1: 0.8750, Avg Acc@5: 1.0000
+INFO:master_logger:Val Step[0000/1563], Avg Loss: 0.6994, Avg Acc@1: 0.9062, Avg Acc@5: 0.9688
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.1879, Avg Acc@1: 0.7377, Avg Acc@5: 0.8848
+INFO:master_logger:Val Step[0050/1563], Avg Loss: 1.1033, Avg Acc@1: 0.7445, Avg Acc@5: 0.9038
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.0049, Avg Acc@1: 0.7549, Avg Acc@5: 0.9240
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.1890, Avg Acc@1: 0.7108, Avg Acc@5: 0.8873
+INFO:local_logger:Val Step[0050/1563], Avg Loss: 1.0314, Avg Acc@1: 0.7745, Avg Acc@5: 0.9191
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.6063, Avg Acc@1: 0.6287, Avg Acc@5: 0.8267
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.5188, Avg Acc@1: 0.6473, Avg Acc@5: 0.8416
+INFO:master_logger:Val Step[0100/1563], Avg Loss: 1.5524, Avg Acc@1: 0.6368, Avg Acc@5: 0.8428
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.6065, Avg Acc@1: 0.6200, Avg Acc@5: 0.8416
+INFO:local_logger:Val Step[0100/1563], Avg Loss: 1.4780, Avg Acc@1: 0.6510, Avg Acc@5: 0.8614
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.3899, Avg Acc@1: 0.6722, Avg Acc@5: 0.8634
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.3968, Avg Acc@1: 0.6697, Avg Acc@5: 0.8684
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.4981, Avg Acc@1: 0.6565, Avg Acc@5: 0.8444
+INFO:local_logger:Val Step[0150/1563], Avg Loss: 1.4698, Avg Acc@1: 0.6490, Avg Acc@5: 0.8626
+INFO:master_logger:Val Step[0150/1563], Avg Loss: 1.4386, Avg Acc@1: 0.6618, Avg Acc@5: 0.8597
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.5489, Avg Acc@1: 0.6443, Avg Acc@5: 0.8358
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.4461, Avg Acc@1: 0.6692, Avg Acc@5: 0.8570
+INFO:master_logger:Val Step[0200/1563], Avg Loss: 1.4880, Avg Acc@1: 0.6507, Avg Acc@5: 0.8532
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.4569, Avg Acc@1: 0.6524, Avg Acc@5: 0.8626
+INFO:local_logger:Val Step[0200/1563], Avg Loss: 1.5000, Avg Acc@1: 0.6368, Avg Acc@5: 0.8576
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.4811, Avg Acc@1: 0.6514, Avg Acc@5: 0.8466
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.4145, Avg Acc@1: 0.6643, Avg Acc@5: 0.8650
+INFO:master_logger:Val Step[0250/1563], Avg Loss: 1.4244, Avg Acc@1: 0.6614, Avg Acc@5: 0.8618
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.4224, Avg Acc@1: 0.6524, Avg Acc@5: 0.8660
+INFO:local_logger:Val Step[0250/1563], Avg Loss: 1.3798, Avg Acc@1: 0.6773, Avg Acc@5: 0.8695
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.4788, Avg Acc@1: 0.6507, Avg Acc@5: 0.8576
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.5107, Avg Acc@1: 0.6242, Avg Acc@5: 0.8580
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.5524, Avg Acc@1: 0.6292, Avg Acc@5: 0.8385
+INFO:master_logger:Val Step[0300/1563], Avg Loss: 1.5111, Avg Acc@1: 0.6352, Avg Acc@5: 0.8525
+INFO:local_logger:Val Step[0300/1563], Avg Loss: 1.5026, Avg Acc@1: 0.6366, Avg Acc@5: 0.8559
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.5540, Avg Acc@1: 0.6239, Avg Acc@5: 0.8497
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.4794, Avg Acc@1: 0.6392, Avg Acc@5: 0.8608
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.5341, Avg Acc@1: 0.6161, Avg Acc@5: 0.8579
+INFO:local_logger:Val Step[0350/1563], Avg Loss: 1.5668, Avg Acc@1: 0.6229, Avg Acc@5: 0.8408
+INFO:master_logger:Val Step[0350/1563], Avg Loss: 1.5336, Avg Acc@1: 0.6255, Avg Acc@5: 0.8523
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.4879, Avg Acc@1: 0.6334, Avg Acc@5: 0.8619
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.5311, Avg Acc@1: 0.6125, Avg Acc@5: 0.8616
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.5616, Avg Acc@1: 0.6169, Avg Acc@5: 0.8526
+INFO:local_logger:Val Step[0400/1563], Avg Loss: 1.5775, Avg Acc@1: 0.6138, Avg Acc@5: 0.8426
+INFO:master_logger:Val Step[0400/1563], Avg Loss: 1.5395, Avg Acc@1: 0.6192, Avg Acc@5: 0.8547
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.5367, Avg Acc@1: 0.6106, Avg Acc@5: 0.8628
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.5787, Avg Acc@1: 0.6106, Avg Acc@5: 0.8451
+INFO:master_logger:Val Step[0450/1563], Avg Loss: 1.5533, Avg Acc@1: 0.6146, Avg Acc@5: 0.8542
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.5809, Avg Acc@1: 0.6159, Avg Acc@5: 0.8498
+INFO:local_logger:Val Step[0450/1563], Avg Loss: 1.5169, Avg Acc@1: 0.6214, Avg Acc@5: 0.8592
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.5079, Avg Acc@1: 0.6255, Avg Acc@5: 0.8610
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.5310, Avg Acc@1: 0.6133, Avg Acc@5: 0.8625
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.5656, Avg Acc@1: 0.6200, Avg Acc@5: 0.8535
+INFO:local_logger:Val Step[0500/1563], Avg Loss: 1.5579, Avg Acc@1: 0.6170, Avg Acc@5: 0.8493
+INFO:master_logger:Val Step[0500/1563], Avg Loss: 1.5406, Avg Acc@1: 0.6189, Avg Acc@5: 0.8566
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.5304, Avg Acc@1: 0.6248, Avg Acc@5: 0.8525
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.5374, Avg Acc@1: 0.6279, Avg Acc@5: 0.8548
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.4606, Avg Acc@1: 0.6377, Avg Acc@5: 0.8659
+INFO:local_logger:Val Step[0550/1563], Avg Loss: 1.4990, Avg Acc@1: 0.6223, Avg Acc@5: 0.8666
+INFO:master_logger:Val Step[0550/1563], Avg Loss: 1.5069, Avg Acc@1: 0.6282, Avg Acc@5: 0.8600
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.4722, Avg Acc@1: 0.6379, Avg Acc@5: 0.8642
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.5349, Avg Acc@1: 0.6277, Avg Acc@5: 0.8502
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.5373, Avg Acc@1: 0.6285, Avg Acc@5: 0.8536
+INFO:master_logger:Val Step[0600/1563], Avg Loss: 1.5127, Avg Acc@1: 0.6298, Avg Acc@5: 0.8586
+INFO:local_logger:Val Step[0600/1563], Avg Loss: 1.5065, Avg Acc@1: 0.6250, Avg Acc@5: 0.8665
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.4965, Avg Acc@1: 0.6342, Avg Acc@5: 0.8614
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.5523, Avg Acc@1: 0.6252, Avg Acc@5: 0.8479
+INFO:master_logger:Val Step[0650/1563], Avg Loss: 1.5377, Avg Acc@1: 0.6262, Avg Acc@5: 0.8546
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.5612, Avg Acc@1: 0.6246, Avg Acc@5: 0.8487
+INFO:local_logger:Val Step[0650/1563], Avg Loss: 1.5407, Avg Acc@1: 0.6210, Avg Acc@5: 0.8606
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.5931, Avg Acc@1: 0.6102, Avg Acc@5: 0.8516
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.5996, Avg Acc@1: 0.6155, Avg Acc@5: 0.8408
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.5515, Avg Acc@1: 0.6250, Avg Acc@5: 0.8522
+INFO:master_logger:Val Step[0700/1563], Avg Loss: 1.5883, Avg Acc@1: 0.6165, Avg Acc@5: 0.8468
+INFO:local_logger:Val Step[0700/1563], Avg Loss: 1.6091, Avg Acc@1: 0.6154, Avg Acc@5: 0.8427
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.6485, Avg Acc@1: 0.6050, Avg Acc@5: 0.8332
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.6172, Avg Acc@1: 0.6125, Avg Acc@5: 0.8410
+INFO:master_logger:Val Step[0750/1563], Avg Loss: 1.6435, Avg Acc@1: 0.6061, Avg Acc@5: 0.8378
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.6583, Avg Acc@1: 0.6079, Avg Acc@5: 0.8339
+INFO:local_logger:Val Step[0750/1563], Avg Loss: 1.6498, Avg Acc@1: 0.5989, Avg Acc@5: 0.8430
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.6766, Avg Acc@1: 0.6002, Avg Acc@5: 0.8326
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.7044, Avg Acc@1: 0.5880, Avg Acc@5: 0.8357
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.7150, Avg Acc@1: 0.5964, Avg Acc@5: 0.8258
+INFO:local_logger:Val Step[0800/1563], Avg Loss: 1.7139, Avg Acc@1: 0.5921, Avg Acc@5: 0.8224
+INFO:master_logger:Val Step[0800/1563], Avg Loss: 1.7025, Avg Acc@1: 0.5942, Avg Acc@5: 0.8291
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.7410, Avg Acc@1: 0.5887, Avg Acc@5: 0.8186
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.7153, Avg Acc@1: 0.5940, Avg Acc@5: 0.8278
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.7607, Avg Acc@1: 0.5868, Avg Acc@5: 0.8179
+INFO:master_logger:Val Step[0850/1563], Avg Loss: 1.7381, Avg Acc@1: 0.5884, Avg Acc@5: 0.8236
+INFO:local_logger:Val Step[0850/1563], Avg Loss: 1.7354, Avg Acc@1: 0.5840, Avg Acc@5: 0.8299
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.7458, Avg Acc@1: 0.5900, Avg Acc@5: 0.8173
+INFO:master_logger:Val Step[0900/1563], Avg Loss: 1.7447, Avg Acc@1: 0.5887, Avg Acc@5: 0.8221
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.7723, Avg Acc@1: 0.5849, Avg Acc@5: 0.8160
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.7371, Avg Acc@1: 0.5859, Avg Acc@5: 0.8294
+INFO:local_logger:Val Step[0900/1563], Avg Loss: 1.7235, Avg Acc@1: 0.5941, Avg Acc@5: 0.8259
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.7819, Avg Acc@1: 0.5845, Avg Acc@5: 0.8095
+INFO:master_logger:Val Step[0950/1563], Avg Loss: 1.7805, Avg Acc@1: 0.5826, Avg Acc@5: 0.8157
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.8119, Avg Acc@1: 0.5782, Avg Acc@5: 0.8099
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.7569, Avg Acc@1: 0.5875, Avg Acc@5: 0.8199
+INFO:local_logger:Val Step[0950/1563], Avg Loss: 1.7711, Avg Acc@1: 0.5803, Avg Acc@5: 0.8232
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.7941, Avg Acc@1: 0.5819, Avg Acc@5: 0.8148
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.8365, Avg Acc@1: 0.5719, Avg Acc@5: 0.8069
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.7976, Avg Acc@1: 0.5760, Avg Acc@5: 0.8191
+INFO:local_logger:Val Step[1000/1563], Avg Loss: 1.8198, Avg Acc@1: 0.5780, Avg Acc@5: 0.8048
+INFO:master_logger:Val Step[1000/1563], Avg Loss: 1.8120, Avg Acc@1: 0.5770, Avg Acc@5: 0.8114
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 1.8589, Avg Acc@1: 0.5678, Avg Acc@5: 0.8034
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 1.8143, Avg Acc@1: 0.5737, Avg Acc@5: 0.8179
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 1.8422, Avg Acc@1: 0.5736, Avg Acc@5: 0.8022
+INFO:local_logger:Val Step[1050/1563], Avg Loss: 1.8174, Avg Acc@1: 0.5759, Avg Acc@5: 0.8102
+INFO:master_logger:Val Step[1050/1563], Avg Loss: 1.8332, Avg Acc@1: 0.5728, Avg Acc@5: 0.8084
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 1.8857, Avg Acc@1: 0.5628, Avg Acc@5: 0.7983
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 1.8493, Avg Acc@1: 0.5711, Avg Acc@5: 0.8040
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 1.8375, Avg Acc@1: 0.5703, Avg Acc@5: 0.8129
+INFO:local_logger:Val Step[1100/1563], Avg Loss: 1.8638, Avg Acc@1: 0.5693, Avg Acc@5: 0.7977
+INFO:master_logger:Val Step[1100/1563], Avg Loss: 1.8590, Avg Acc@1: 0.5683, Avg Acc@5: 0.8032
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 1.8838, Avg Acc@1: 0.5672, Avg Acc@5: 0.7952
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 1.8714, Avg Acc@1: 0.5669, Avg Acc@5: 0.7997
+INFO:master_logger:Val Step[1150/1563], Avg Loss: 1.8817, Avg Acc@1: 0.5647, Avg Acc@5: 0.7995
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 1.9110, Avg Acc@1: 0.5581, Avg Acc@5: 0.7942
+INFO:local_logger:Val Step[1150/1563], Avg Loss: 1.8605, Avg Acc@1: 0.5666, Avg Acc@5: 0.8090
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 1.8979, Avg Acc@1: 0.5627, Avg Acc@5: 0.7947
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 1.9065, Avg Acc@1: 0.5636, Avg Acc@5: 0.7914
+INFO:master_logger:Val Step[1200/1563], Avg Loss: 1.9062, Avg Acc@1: 0.5608, Avg Acc@5: 0.7953
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 1.9376, Avg Acc@1: 0.5537, Avg Acc@5: 0.7899
+INFO:local_logger:Val Step[1200/1563], Avg Loss: 1.8828, Avg Acc@1: 0.5632, Avg Acc@5: 0.8052
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 1.9527, Avg Acc@1: 0.5519, Avg Acc@5: 0.7877
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 1.9308, Avg Acc@1: 0.5602, Avg Acc@5: 0.7876
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 1.9013, Avg Acc@1: 0.5600, Avg Acc@5: 0.8017
+INFO:master_logger:Val Step[1250/1563], Avg Loss: 1.9268, Avg Acc@1: 0.5578, Avg Acc@5: 0.7918
+INFO:local_logger:Val Step[1250/1563], Avg Loss: 1.9223, Avg Acc@1: 0.5592, Avg Acc@5: 0.7905
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 1.9658, Avg Acc@1: 0.5485, Avg Acc@5: 0.7854
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 1.9499, Avg Acc@1: 0.5536, Avg Acc@5: 0.7858
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 1.9432, Avg Acc@1: 0.5574, Avg Acc@5: 0.7863
+INFO:local_logger:Val Step[1300/1563], Avg Loss: 1.9202, Avg Acc@1: 0.5566, Avg Acc@5: 0.7981
+INFO:master_logger:Val Step[1300/1563], Avg Loss: 1.9448, Avg Acc@1: 0.5540, Avg Acc@5: 0.7889
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 1.9502, Avg Acc@1: 0.5511, Avg Acc@5: 0.7934
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 1.9728, Avg Acc@1: 0.5501, Avg Acc@5: 0.7817
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 1.9953, Avg Acc@1: 0.5417, Avg Acc@5: 0.7805
+INFO:master_logger:Val Step[1350/1563], Avg Loss: 1.9721, Avg Acc@1: 0.5482, Avg Acc@5: 0.7846
+INFO:local_logger:Val Step[1350/1563], Avg Loss: 1.9701, Avg Acc@1: 0.5501, Avg Acc@5: 0.7828
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 1.9781, Avg Acc@1: 0.5477, Avg Acc@5: 0.7819
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 1.9641, Avg Acc@1: 0.5482, Avg Acc@5: 0.7911
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 2.0041, Avg Acc@1: 0.5390, Avg Acc@5: 0.7789
+INFO:local_logger:Val Step[1400/1563], Avg Loss: 1.9843, Avg Acc@1: 0.5455, Avg Acc@5: 0.7806
+INFO:master_logger:Val Step[1400/1563], Avg Loss: 1.9827, Avg Acc@1: 0.5451, Avg Acc@5: 0.7831
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 2.0122, Avg Acc@1: 0.5380, Avg Acc@5: 0.7777
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 1.9690, Avg Acc@1: 0.5472, Avg Acc@5: 0.7902
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 1.9815, Avg Acc@1: 0.5478, Avg Acc@5: 0.7818
+INFO:local_logger:Val Step[1450/1563], Avg Loss: 1.9847, Avg Acc@1: 0.5446, Avg Acc@5: 0.7814
+INFO:master_logger:Val Step[1450/1563], Avg Loss: 1.9869, Avg Acc@1: 0.5444, Avg Acc@5: 0.7828
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 1.9625, Avg Acc@1: 0.5518, Avg Acc@5: 0.7852
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 1.9946, Avg Acc@1: 0.5416, Avg Acc@5: 0.7806
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 1.9649, Avg Acc@1: 0.5480, Avg Acc@5: 0.7847
+INFO:master_logger:Val Step[1500/1563], Avg Loss: 1.9663, Avg Acc@1: 0.5484, Avg Acc@5: 0.7862
+INFO:local_logger:Val Step[1500/1563], Avg Loss: 1.9432, Avg Acc@1: 0.5525, Avg Acc@5: 0.7942
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 1.9504, Avg Acc@1: 0.5513, Avg Acc@5: 0.7866
+INFO:master_logger:Val Step[1550/1563], Avg Loss: 1.9572, Avg Acc@1: 0.5505, Avg Acc@5: 0.7873
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 1.9383, Avg Acc@1: 0.5534, Avg Acc@5: 0.7947
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 1.9812, Avg Acc@1: 0.5442, Avg Acc@5: 0.7822
+INFO:local_logger:Val Step[1550/1563], Avg Loss: 1.9588, Avg Acc@1: 0.5531, Avg Acc@5: 0.7857
+INFO:local_logger:----- Epoch[040/300], Validation Loss: 1.9566, Validation Acc@1: 0.5536, Validation Acc@5: 0.7856, time: 180.16
+INFO:local_logger:Now training epoch 41. LR=0.000377
+INFO:local_logger:----- Epoch[040/300], Validation Loss: 1.9771, Validation Acc@1: 0.5456, Validation Acc@5: 0.7826, time: 180.50
+INFO:local_logger:Now training epoch 41. LR=0.000377
+INFO:local_logger:----- Epoch[040/300], Validation Loss: 1.9468, Validation Acc@1: 0.5522, Validation Acc@5: 0.7867, time: 180.66
+INFO:master_logger:----- Epoch[040/300], Validation Loss: 1.9538, Validation Acc@1: 0.5514, Validation Acc@5: 0.7875, time: 180.66
+INFO:local_logger:----- Epoch[040/300], Validation Loss: 1.9349, Validation Acc@1: 0.5543, Validation Acc@5: 0.7952, time: 180.61
+INFO:local_logger:Now training epoch 41. LR=0.000377
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-40-Loss-4.019246149644917.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-40-Loss-4.019246149644917.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-40-Loss-4.019246149644917-EMA.pdparams
+INFO:local_logger:Now training epoch 41. LR=0.000377
+INFO:master_logger:Now training epoch 41. LR=0.000377
+INFO:local_logger:Epoch[041/300], Step[0000/1602], Avg Loss: 4.6235, Avg Acc: 0.1900
+INFO:local_logger:Epoch[041/300], Step[0000/1602], Avg Loss: 3.1434, Avg Acc: 0.3900
+INFO:local_logger:Epoch[041/300], Step[0000/1602], Avg Loss: 4.3518, Avg Acc: 0.0350
+INFO:master_logger:Epoch[041/300], Step[0000/1602], Avg Loss: 4.0255, Avg Acc: 0.2250
+INFO:local_logger:Epoch[041/300], Step[0000/1602], Avg Loss: 3.9836, Avg Acc: 0.2850
+INFO:local_logger:Epoch[041/300], Step[0050/1602], Avg Loss: 3.9656, Avg Acc: 0.2476
+INFO:local_logger:Epoch[041/300], Step[0050/1602], Avg Loss: 4.1441, Avg Acc: 0.1995
+INFO:local_logger:Epoch[041/300], Step[0050/1602], Avg Loss: 3.9478, Avg Acc: 0.2313
+INFO:local_logger:Epoch[041/300], Step[0050/1602], Avg Loss: 3.9654, Avg Acc: 0.2104
+INFO:master_logger:Epoch[041/300], Step[0050/1602], Avg Loss: 4.0057, Avg Acc: 0.2222
+INFO:local_logger:Epoch[041/300], Step[0100/1602], Avg Loss: 3.9502, Avg Acc: 0.2285
+INFO:local_logger:Epoch[041/300], Step[0100/1602], Avg Loss: 4.0080, Avg Acc: 0.2132
+INFO:local_logger:Epoch[041/300], Step[0100/1602], Avg Loss: 4.0311, Avg Acc: 0.2077
+INFO:local_logger:Epoch[041/300], Step[0100/1602], Avg Loss: 4.0489, Avg Acc: 0.2072
+INFO:master_logger:Epoch[041/300], Step[0100/1602], Avg Loss: 4.0096, Avg Acc: 0.2142
+INFO:local_logger:Epoch[041/300], Step[0150/1602], Avg Loss: 3.9455, Avg Acc: 0.2427
+INFO:local_logger:Epoch[041/300], Step[0150/1602], Avg Loss: 4.0097, Avg Acc: 0.2130
+INFO:local_logger:Epoch[041/300], Step[0150/1602], Avg Loss: 3.9714, Avg Acc: 0.2112
+INFO:local_logger:Epoch[041/300], Step[0150/1602], Avg Loss: 4.0149, Avg Acc: 0.2091
+INFO:master_logger:Epoch[041/300], Step[0150/1602], Avg Loss: 3.9854, Avg Acc: 0.2190
+INFO:local_logger:Epoch[041/300], Step[0200/1602], Avg Loss: 3.9756, Avg Acc: 0.2247
+INFO:local_logger:Epoch[041/300], Step[0200/1602], Avg Loss: 3.9687, Avg Acc: 0.2160
+INFO:local_logger:Epoch[041/300], Step[0200/1602], Avg Loss: 3.9769, Avg Acc: 0.2128
+INFO:local_logger:Epoch[041/300], Step[0200/1602], Avg Loss: 3.9799, Avg Acc: 0.2221
+INFO:master_logger:Epoch[041/300], Step[0200/1602], Avg Loss: 3.9753, Avg Acc: 0.2189
+INFO:local_logger:Epoch[041/300], Step[0250/1602], Avg Loss: 4.0033, Avg Acc: 0.2184
+INFO:local_logger:Epoch[041/300], Step[0250/1602], Avg Loss: 3.9657, Avg Acc: 0.2163
+INFO:local_logger:Epoch[041/300], Step[0250/1602], Avg Loss: 3.9785, Avg Acc: 0.2203
+INFO:local_logger:Epoch[041/300], Step[0250/1602], Avg Loss: 3.9674, Avg Acc: 0.2151
+INFO:master_logger:Epoch[041/300], Step[0250/1602], Avg Loss: 3.9787, Avg Acc: 0.2175
+INFO:local_logger:Epoch[041/300], Step[0300/1602], Avg Loss: 3.9962, Avg Acc: 0.2216
+INFO:local_logger:Epoch[041/300], Step[0300/1602], Avg Loss: 3.9857, Avg Acc: 0.2241
+INFO:local_logger:Epoch[041/300], Step[0300/1602], Avg Loss: 3.9743, Avg Acc: 0.2146
+INFO:master_logger:Epoch[041/300], Step[0300/1602], Avg Loss: 3.9843, Avg Acc: 0.2180
+INFO:local_logger:Epoch[041/300], Step[0300/1602], Avg Loss: 3.9810, Avg Acc: 0.2118
+INFO:local_logger:Epoch[041/300], Step[0350/1602], Avg Loss: 3.9938, Avg Acc: 0.2140
+INFO:local_logger:Epoch[041/300], Step[0350/1602], Avg Loss: 3.9900, Avg Acc: 0.2116
+INFO:local_logger:Epoch[041/300], Step[0350/1602], Avg Loss: 3.9902, Avg Acc: 0.2284
+INFO:master_logger:Epoch[041/300], Step[0350/1602], Avg Loss: 3.9861, Avg Acc: 0.2174
+INFO:local_logger:Epoch[041/300], Step[0350/1602], Avg Loss: 3.9702, Avg Acc: 0.2157
+INFO:local_logger:Epoch[041/300], Step[0400/1602], Avg Loss: 3.9849, Avg Acc: 0.2177
+INFO:local_logger:Epoch[041/300], Step[0400/1602], Avg Loss: 3.9860, Avg Acc: 0.2268
+INFO:local_logger:Epoch[041/300], Step[0400/1602], Avg Loss: 3.9897, Avg Acc: 0.2143
+INFO:local_logger:Epoch[041/300], Step[0400/1602], Avg Loss: 3.9737, Avg Acc: 0.2135
+INFO:master_logger:Epoch[041/300], Step[0400/1602], Avg Loss: 3.9836, Avg Acc: 0.2181
+INFO:local_logger:Epoch[041/300], Step[0450/1602], Avg Loss: 3.9972, Avg Acc: 0.2188
+INFO:master_logger:Epoch[041/300], Step[0450/1602], Avg Loss: 3.9861, Avg Acc: 0.2180
+INFO:local_logger:Epoch[041/300], Step[0450/1602], Avg Loss: 3.9827, Avg Acc: 0.2257
+INFO:local_logger:Epoch[041/300], Step[0450/1602], Avg Loss: 4.0016, Avg Acc: 0.2129
+INFO:local_logger:Epoch[041/300], Step[0450/1602], Avg Loss: 3.9627, Avg Acc: 0.2145
+INFO:local_logger:Epoch[041/300], Step[0500/1602], Avg Loss: 3.9906, Avg Acc: 0.2168
+INFO:local_logger:Epoch[041/300], Step[0500/1602], Avg Loss: 3.9756, Avg Acc: 0.2269
+INFO:local_logger:Epoch[041/300], Step[0500/1602], Avg Loss: 3.9738, Avg Acc: 0.2149
+INFO:local_logger:Epoch[041/300], Step[0500/1602], Avg Loss: 3.9928, Avg Acc: 0.2116
+INFO:master_logger:Epoch[041/300], Step[0500/1602], Avg Loss: 3.9832, Avg Acc: 0.2175
+INFO:local_logger:Epoch[041/300], Step[0550/1602], Avg Loss: 3.9795, Avg Acc: 0.2166
+INFO:local_logger:Epoch[041/300], Step[0550/1602], Avg Loss: 3.9800, Avg Acc: 0.2183
+INFO:local_logger:Epoch[041/300], Step[0550/1602], Avg Loss: 3.9829, Avg Acc: 0.2267
+INFO:local_logger:Epoch[041/300], Step[0550/1602], Avg Loss: 3.9900, Avg Acc: 0.2119
+INFO:master_logger:Epoch[041/300], Step[0550/1602], Avg Loss: 3.9831, Avg Acc: 0.2184
+INFO:local_logger:Epoch[041/300], Step[0600/1602], Avg Loss: 3.9716, Avg Acc: 0.2164
+INFO:local_logger:Epoch[041/300], Step[0600/1602], Avg Loss: 3.9862, Avg Acc: 0.2124
+INFO:local_logger:Epoch[041/300], Step[0600/1602], Avg Loss: 3.9820, Avg Acc: 0.2138
+INFO:local_logger:Epoch[041/300], Step[0600/1602], Avg Loss: 3.9841, Avg Acc: 0.2275
+INFO:master_logger:Epoch[041/300], Step[0600/1602], Avg Loss: 3.9810, Avg Acc: 0.2175
+INFO:local_logger:Epoch[041/300], Step[0650/1602], Avg Loss: 3.9826, Avg Acc: 0.2272
+INFO:local_logger:Epoch[041/300], Step[0650/1602], Avg Loss: 3.9922, Avg Acc: 0.2181
+INFO:local_logger:Epoch[041/300], Step[0650/1602], Avg Loss: 3.9902, Avg Acc: 0.2129
+INFO:local_logger:Epoch[041/300], Step[0650/1602], Avg Loss: 3.9732, Avg Acc: 0.2142
+INFO:master_logger:Epoch[041/300], Step[0650/1602], Avg Loss: 3.9845, Avg Acc: 0.2181
+INFO:local_logger:Epoch[041/300], Step[0700/1602], Avg Loss: 3.9858, Avg Acc: 0.2266
+INFO:local_logger:Epoch[041/300], Step[0700/1602], Avg Loss: 3.9981, Avg Acc: 0.2118
+INFO:local_logger:Epoch[041/300], Step[0700/1602], Avg Loss: 3.9798, Avg Acc: 0.2149
+INFO:local_logger:Epoch[041/300], Step[0700/1602], Avg Loss: 3.9864, Avg Acc: 0.2187
+INFO:master_logger:Epoch[041/300], Step[0700/1602], Avg Loss: 3.9875, Avg Acc: 0.2180
+INFO:local_logger:Epoch[041/300], Step[0750/1602], Avg Loss: 3.9850, Avg Acc: 0.2199
+INFO:local_logger:Epoch[041/300], Step[0750/1602], Avg Loss: 3.9995, Avg Acc: 0.2092
+INFO:master_logger:Epoch[041/300], Step[0750/1602], Avg Loss: 3.9886, Avg Acc: 0.2176
+INFO:local_logger:Epoch[041/300], Step[0750/1602], Avg Loss: 3.9921, Avg Acc: 0.2280
+INFO:local_logger:Epoch[041/300], Step[0750/1602], Avg Loss: 3.9778, Avg Acc: 0.2135
+INFO:local_logger:Epoch[041/300], Step[0800/1602], Avg Loss: 3.9841, Avg Acc: 0.2172
+INFO:master_logger:Epoch[041/300], Step[0800/1602], Avg Loss: 3.9877, Avg Acc: 0.2161
+INFO:local_logger:Epoch[041/300], Step[0800/1602], Avg Loss: 3.9882, Avg Acc: 0.2086
+INFO:local_logger:Epoch[041/300], Step[0800/1602], Avg Loss: 3.9930, Avg Acc: 0.2258
+INFO:local_logger:Epoch[041/300], Step[0800/1602], Avg Loss: 3.9858, Avg Acc: 0.2128
+INFO:local_logger:Epoch[041/300], Step[0850/1602], Avg Loss: 3.9858, Avg Acc: 0.2174
+INFO:local_logger:Epoch[041/300], Step[0850/1602], Avg Loss: 3.9854, Avg Acc: 0.2092
+INFO:master_logger:Epoch[041/300], Step[0850/1602], Avg Loss: 3.9903, Avg Acc: 0.2162
+INFO:local_logger:Epoch[041/300], Step[0850/1602], Avg Loss: 3.9965, Avg Acc: 0.2124
+INFO:local_logger:Epoch[041/300], Step[0850/1602], Avg Loss: 3.9936, Avg Acc: 0.2256
+INFO:local_logger:Epoch[041/300], Step[0900/1602], Avg Loss: 3.9874, Avg Acc: 0.2164
+INFO:local_logger:Epoch[041/300], Step[0900/1602], Avg Loss: 3.9981, Avg Acc: 0.2135
+INFO:local_logger:Epoch[041/300], Step[0900/1602], Avg Loss: 3.9921, Avg Acc: 0.2262
+INFO:local_logger:Epoch[041/300], Step[0900/1602], Avg Loss: 3.9898, Avg Acc: 0.2089
+INFO:master_logger:Epoch[041/300], Step[0900/1602], Avg Loss: 3.9918, Avg Acc: 0.2162
+INFO:local_logger:Epoch[041/300], Step[0950/1602], Avg Loss: 3.9820, Avg Acc: 0.2178
+INFO:local_logger:Epoch[041/300], Step[0950/1602], Avg Loss: 4.0080, Avg Acc: 0.2133
+INFO:local_logger:Epoch[041/300], Step[0950/1602], Avg Loss: 3.9902, Avg Acc: 0.2254
+INFO:local_logger:Epoch[041/300], Step[0950/1602], Avg Loss: 3.9937, Avg Acc: 0.2076
+INFO:master_logger:Epoch[041/300], Step[0950/1602], Avg Loss: 3.9935, Avg Acc: 0.2160
+INFO:local_logger:Epoch[041/300], Step[1000/1602], Avg Loss: 3.9875, Avg Acc: 0.2182
+INFO:local_logger:Epoch[041/300], Step[1000/1602], Avg Loss: 4.0068, Avg Acc: 0.2136
+INFO:local_logger:Epoch[041/300], Step[1000/1602], Avg Loss: 3.9902, Avg Acc: 0.2093
+INFO:local_logger:Epoch[041/300], Step[1000/1602], Avg Loss: 3.9878, Avg Acc: 0.2263
+INFO:master_logger:Epoch[041/300], Step[1000/1602], Avg Loss: 3.9931, Avg Acc: 0.2169
+INFO:local_logger:Epoch[041/300], Step[1050/1602], Avg Loss: 3.9864, Avg Acc: 0.2261
+INFO:local_logger:Epoch[041/300], Step[1050/1602], Avg Loss: 3.9853, Avg Acc: 0.2193
+INFO:local_logger:Epoch[041/300], Step[1050/1602], Avg Loss: 3.9900, Avg Acc: 0.2093
+INFO:local_logger:Epoch[041/300], Step[1050/1602], Avg Loss: 4.0072, Avg Acc: 0.2122
+INFO:master_logger:Epoch[041/300], Step[1050/1602], Avg Loss: 3.9922, Avg Acc: 0.2167
+INFO:local_logger:Epoch[041/300], Step[1100/1602], Avg Loss: 4.0046, Avg Acc: 0.2116
+INFO:local_logger:Epoch[041/300], Step[1100/1602], Avg Loss: 3.9857, Avg Acc: 0.2198
+INFO:local_logger:Epoch[041/300], Step[1100/1602], Avg Loss: 3.9863, Avg Acc: 0.2092
+INFO:local_logger:Epoch[041/300], Step[1100/1602], Avg Loss: 3.9832, Avg Acc: 0.2250
+INFO:master_logger:Epoch[041/300], Step[1100/1602], Avg Loss: 3.9899, Avg Acc: 0.2164
+INFO:local_logger:Epoch[041/300], Step[1150/1602], Avg Loss: 3.9885, Avg Acc: 0.2192
+INFO:local_logger:Epoch[041/300], Step[1150/1602], Avg Loss: 4.0046, Avg Acc: 0.2109
+INFO:local_logger:Epoch[041/300], Step[1150/1602], Avg Loss: 3.9868, Avg Acc: 0.2236
+INFO:local_logger:Epoch[041/300], Step[1150/1602], Avg Loss: 3.9854, Avg Acc: 0.2080
+INFO:master_logger:Epoch[041/300], Step[1150/1602], Avg Loss: 3.9913, Avg Acc: 0.2154
+INFO:local_logger:Epoch[041/300], Step[1200/1602], Avg Loss: 3.9870, Avg Acc: 0.2182
+INFO:master_logger:Epoch[041/300], Step[1200/1602], Avg Loss: 3.9895, Avg Acc: 0.2151
+INFO:local_logger:Epoch[041/300], Step[1200/1602], Avg Loss: 3.9829, Avg Acc: 0.2069
+INFO:local_logger:Epoch[041/300], Step[1200/1602], Avg Loss: 4.0034, Avg Acc: 0.2118
+INFO:local_logger:Epoch[041/300], Step[1200/1602], Avg Loss: 3.9848, Avg Acc: 0.2234
+INFO:local_logger:Epoch[041/300], Step[1250/1602], Avg Loss: 3.9890, Avg Acc: 0.2182
+INFO:local_logger:Epoch[041/300], Step[1250/1602], Avg Loss: 3.9850, Avg Acc: 0.2071
+INFO:local_logger:Epoch[041/300], Step[1250/1602], Avg Loss: 3.9826, Avg Acc: 0.2230
+INFO:master_logger:Epoch[041/300], Step[1250/1602], Avg Loss: 3.9892, Avg Acc: 0.2153
+INFO:local_logger:Epoch[041/300], Step[1250/1602], Avg Loss: 3.9999, Avg Acc: 0.2129
+INFO:local_logger:Epoch[041/300], Step[1300/1602], Avg Loss: 3.9931, Avg Acc: 0.2179
+INFO:local_logger:Epoch[041/300], Step[1300/1602], Avg Loss: 3.9805, Avg Acc: 0.2087
+INFO:local_logger:Epoch[041/300], Step[1300/1602], Avg Loss: 4.0004, Avg Acc: 0.2128
+INFO:local_logger:Epoch[041/300], Step[1300/1602], Avg Loss: 3.9832, Avg Acc: 0.2221
+INFO:master_logger:Epoch[041/300], Step[1300/1602], Avg Loss: 3.9893, Avg Acc: 0.2154
+INFO:local_logger:Epoch[041/300], Step[1350/1602], Avg Loss: 3.9980, Avg Acc: 0.2176
+INFO:local_logger:Epoch[041/300], Step[1350/1602], Avg Loss: 3.9791, Avg Acc: 0.2089
+INFO:local_logger:Epoch[041/300], Step[1350/1602], Avg Loss: 3.9874, Avg Acc: 0.2219
+INFO:local_logger:Epoch[041/300], Step[1350/1602], Avg Loss: 4.0022, Avg Acc: 0.2131
+INFO:master_logger:Epoch[041/300], Step[1350/1602], Avg Loss: 3.9917, Avg Acc: 0.2154
+INFO:local_logger:Epoch[041/300], Step[1400/1602], Avg Loss: 3.9884, Avg Acc: 0.2223
+INFO:local_logger:Epoch[041/300], Step[1400/1602], Avg Loss: 4.0081, Avg Acc: 0.2126
+INFO:local_logger:Epoch[041/300], Step[1400/1602], Avg Loss: 3.9975, Avg Acc: 0.2183
+INFO:local_logger:Epoch[041/300], Step[1400/1602], Avg Loss: 3.9801, Avg Acc: 0.2095
+INFO:master_logger:Epoch[041/300], Step[1400/1602], Avg Loss: 3.9935, Avg Acc: 0.2157
+INFO:local_logger:Epoch[041/300], Step[1450/1602], Avg Loss: 3.9969, Avg Acc: 0.2161
+INFO:local_logger:Epoch[041/300], Step[1450/1602], Avg Loss: 3.9948, Avg Acc: 0.2213
+INFO:local_logger:Epoch[041/300], Step[1450/1602], Avg Loss: 4.0049, Avg Acc: 0.2125
+INFO:local_logger:Epoch[041/300], Step[1450/1602], Avg Loss: 3.9814, Avg Acc: 0.2100
+INFO:master_logger:Epoch[041/300], Step[1450/1602], Avg Loss: 3.9945, Avg Acc: 0.2150
+INFO:local_logger:Epoch[041/300], Step[1500/1602], Avg Loss: 3.9954, Avg Acc: 0.2158
+INFO:local_logger:Epoch[041/300], Step[1500/1602], Avg Loss: 3.9861, Avg Acc: 0.2095
+INFO:local_logger:Epoch[041/300], Step[1500/1602], Avg Loss: 4.0048, Avg Acc: 0.2124
+INFO:local_logger:Epoch[041/300], Step[1500/1602], Avg Loss: 3.9947, Avg Acc: 0.2210
+INFO:master_logger:Epoch[041/300], Step[1500/1602], Avg Loss: 3.9953, Avg Acc: 0.2147
+INFO:local_logger:Epoch[041/300], Step[1550/1602], Avg Loss: 3.9877, Avg Acc: 0.2099
+INFO:local_logger:Epoch[041/300], Step[1550/1602], Avg Loss: 4.0001, Avg Acc: 0.2150
+INFO:local_logger:Epoch[041/300], Step[1550/1602], Avg Loss: 3.9946, Avg Acc: 0.2208
+INFO:master_logger:Epoch[041/300], Step[1550/1602], Avg Loss: 3.9961, Avg Acc: 0.2148
+INFO:local_logger:Epoch[041/300], Step[1550/1602], Avg Loss: 4.0019, Avg Acc: 0.2136
+INFO:local_logger:Epoch[041/300], Step[1600/1602], Avg Loss: 3.9994, Avg Acc: 0.2132
+INFO:local_logger:Epoch[041/300], Step[1600/1602], Avg Loss: 3.9968, Avg Acc: 0.2148
+INFO:local_logger:Epoch[041/300], Step[1600/1602], Avg Loss: 3.9876, Avg Acc: 0.2094
+INFO:local_logger:Epoch[041/300], Step[1600/1602], Avg Loss: 3.9933, Avg Acc: 0.2213
+INFO:master_logger:Epoch[041/300], Step[1600/1602], Avg Loss: 3.9943, Avg Acc: 0.2147
+INFO:local_logger:----- Epoch[041/300], Train Loss: 3.9876, Train Acc: 0.2093, time: 3729.04
+INFO:local_logger:Now training epoch 42. LR=0.000376
+INFO:local_logger:----- Epoch[041/300], Train Loss: 3.9968, Train Acc: 0.2147, time: 3728.68
+INFO:master_logger:----- Epoch[041/300], Train Loss: 3.9943, Train Acc: 0.2146, time: 3728.68
+INFO:local_logger:----- Epoch[041/300], Train Loss: 3.9995, Train Acc: 0.2132, time: 3729.49
+INFO:local_logger:Now training epoch 42. LR=0.000376
+INFO:local_logger:----- Epoch[041/300], Train Loss: 3.9933, Train Acc: 0.2213, time: 3729.02
+INFO:local_logger:Now training epoch 42. LR=0.000376
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-41-Loss-3.9968369505457324.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-41-Loss-3.9968369505457324.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-41-Loss-3.9968369505457324-EMA.pdparams
+INFO:local_logger:Now training epoch 42. LR=0.000376
+INFO:master_logger:Now training epoch 42. LR=0.000376
+INFO:local_logger:Epoch[042/300], Step[0000/1602], Avg Loss: 3.6585, Avg Acc: 0.3700
+INFO:local_logger:Epoch[042/300], Step[0000/1602], Avg Loss: 3.9849, Avg Acc: 0.3200
+INFO:master_logger:Epoch[042/300], Step[0000/1602], Avg Loss: 3.8311, Avg Acc: 0.3388
+INFO:local_logger:Epoch[042/300], Step[0000/1602], Avg Loss: 3.7118, Avg Acc: 0.3800
+INFO:local_logger:Epoch[042/300], Step[0000/1602], Avg Loss: 3.9691, Avg Acc: 0.2850
+INFO:local_logger:Epoch[042/300], Step[0050/1602], Avg Loss: 4.0594, Avg Acc: 0.1828
+INFO:local_logger:Epoch[042/300], Step[0050/1602], Avg Loss: 4.1019, Avg Acc: 0.1919
+INFO:master_logger:Epoch[042/300], Step[0050/1602], Avg Loss: 4.0414, Avg Acc: 0.2144
+INFO:local_logger:Epoch[042/300], Step[0050/1602], Avg Loss: 3.9581, Avg Acc: 0.2570
+INFO:local_logger:Epoch[042/300], Step[0050/1602], Avg Loss: 4.0462, Avg Acc: 0.2258
+INFO:local_logger:Epoch[042/300], Step[0100/1602], Avg Loss: 3.9768, Avg Acc: 0.2192
+INFO:local_logger:Epoch[042/300], Step[0100/1602], Avg Loss: 4.0100, Avg Acc: 0.2396
+INFO:local_logger:Epoch[042/300], Step[0100/1602], Avg Loss: 4.0060, Avg Acc: 0.2136
+INFO:local_logger:Epoch[042/300], Step[0100/1602], Avg Loss: 4.0169, Avg Acc: 0.1966
+INFO:master_logger:Epoch[042/300], Step[0100/1602], Avg Loss: 4.0024, Avg Acc: 0.2173
+INFO:local_logger:Epoch[042/300], Step[0150/1602], Avg Loss: 3.9625, Avg Acc: 0.2170
+INFO:local_logger:Epoch[042/300], Step[0150/1602], Avg Loss: 4.0162, Avg Acc: 0.2027
+INFO:local_logger:Epoch[042/300], Step[0150/1602], Avg Loss: 4.0025, Avg Acc: 0.2265
+INFO:master_logger:Epoch[042/300], Step[0150/1602], Avg Loss: 4.0056, Avg Acc: 0.2186
+INFO:local_logger:Epoch[042/300], Step[0150/1602], Avg Loss: 4.0412, Avg Acc: 0.2280
+INFO:local_logger:Epoch[042/300], Step[0200/1602], Avg Loss: 3.9437, Avg Acc: 0.2231
+INFO:local_logger:Epoch[042/300], Step[0200/1602], Avg Loss: 4.0188, Avg Acc: 0.2037
+INFO:local_logger:Epoch[042/300], Step[0200/1602], Avg Loss: 3.9914, Avg Acc: 0.2247
+INFO:master_logger:Epoch[042/300], Step[0200/1602], Avg Loss: 3.9940, Avg Acc: 0.2202
+INFO:local_logger:Epoch[042/300], Step[0200/1602], Avg Loss: 4.0219, Avg Acc: 0.2296
+INFO:local_logger:Epoch[042/300], Step[0250/1602], Avg Loss: 4.0027, Avg Acc: 0.2092
+INFO:local_logger:Epoch[042/300], Step[0250/1602], Avg Loss: 3.9486, Avg Acc: 0.2239
+INFO:local_logger:Epoch[042/300], Step[0250/1602], Avg Loss: 3.9898, Avg Acc: 0.2206
+INFO:master_logger:Epoch[042/300], Step[0250/1602], Avg Loss: 3.9849, Avg Acc: 0.2197
+INFO:local_logger:Epoch[042/300], Step[0250/1602], Avg Loss: 3.9984, Avg Acc: 0.2251
+INFO:local_logger:Epoch[042/300], Step[0300/1602], Avg Loss: 4.0099, Avg Acc: 0.2034
+INFO:local_logger:Epoch[042/300], Step[0300/1602], Avg Loss: 3.9745, Avg Acc: 0.2245
+INFO:local_logger:Epoch[042/300], Step[0300/1602], Avg Loss: 3.9856, Avg Acc: 0.2230
+INFO:local_logger:Epoch[042/300], Step[0300/1602], Avg Loss: 3.9569, Avg Acc: 0.2247
+INFO:master_logger:Epoch[042/300], Step[0300/1602], Avg Loss: 3.9817, Avg Acc: 0.2189
+INFO:local_logger:Epoch[042/300], Step[0350/1602], Avg Loss: 4.0041, Avg Acc: 0.2084
+INFO:local_logger:Epoch[042/300], Step[0350/1602], Avg Loss: 3.9715, Avg Acc: 0.2212
+INFO:local_logger:Epoch[042/300], Step[0350/1602], Avg Loss: 3.9750, Avg Acc: 0.2248
+INFO:local_logger:Epoch[042/300], Step[0350/1602], Avg Loss: 3.9892, Avg Acc: 0.2227
+INFO:master_logger:Epoch[042/300], Step[0350/1602], Avg Loss: 3.9850, Avg Acc: 0.2193
+INFO:local_logger:Epoch[042/300], Step[0400/1602], Avg Loss: 3.9628, Avg Acc: 0.2161
+INFO:local_logger:Epoch[042/300], Step[0400/1602], Avg Loss: 4.0038, Avg Acc: 0.2094
+INFO:local_logger:Epoch[042/300], Step[0400/1602], Avg Loss: 3.9898, Avg Acc: 0.2204
+INFO:local_logger:Epoch[042/300], Step[0400/1602], Avg Loss: 3.9930, Avg Acc: 0.2192
+INFO:master_logger:Epoch[042/300], Step[0400/1602], Avg Loss: 3.9873, Avg Acc: 0.2163
+INFO:local_logger:Epoch[042/300], Step[0450/1602], Avg Loss: 3.9964, Avg Acc: 0.2097
+INFO:local_logger:Epoch[042/300], Step[0450/1602], Avg Loss: 3.9771, Avg Acc: 0.2221
+INFO:local_logger:Epoch[042/300], Step[0450/1602], Avg Loss: 3.9666, Avg Acc: 0.2131
+INFO:local_logger:Epoch[042/300], Step[0450/1602], Avg Loss: 3.9933, Avg Acc: 0.2179
+INFO:master_logger:Epoch[042/300], Step[0450/1602], Avg Loss: 3.9834, Avg Acc: 0.2157
+INFO:local_logger:Epoch[042/300], Step[0500/1602], Avg Loss: 3.9777, Avg Acc: 0.2202
+INFO:local_logger:Epoch[042/300], Step[0500/1602], Avg Loss: 4.0045, Avg Acc: 0.2078
+INFO:local_logger:Epoch[042/300], Step[0500/1602], Avg Loss: 3.9808, Avg Acc: 0.2174
+INFO:local_logger:Epoch[042/300], Step[0500/1602], Avg Loss: 3.9717, Avg Acc: 0.2093
+INFO:master_logger:Epoch[042/300], Step[0500/1602], Avg Loss: 3.9837, Avg Acc: 0.2137
+INFO:local_logger:Epoch[042/300], Step[0550/1602], Avg Loss: 3.9732, Avg Acc: 0.2200
+INFO:local_logger:Epoch[042/300], Step[0550/1602], Avg Loss: 3.9989, Avg Acc: 0.2075
+INFO:local_logger:Epoch[042/300], Step[0550/1602], Avg Loss: 3.9808, Avg Acc: 0.2182
+INFO:local_logger:Epoch[042/300], Step[0550/1602], Avg Loss: 3.9734, Avg Acc: 0.2093
+INFO:master_logger:Epoch[042/300], Step[0550/1602], Avg Loss: 3.9816, Avg Acc: 0.2137
+INFO:local_logger:Epoch[042/300], Step[0600/1602], Avg Loss: 3.9996, Avg Acc: 0.2081
+INFO:local_logger:Epoch[042/300], Step[0600/1602], Avg Loss: 3.9800, Avg Acc: 0.2200
+INFO:master_logger:Epoch[042/300], Step[0600/1602], Avg Loss: 3.9819, Avg Acc: 0.2137
+INFO:local_logger:Epoch[042/300], Step[0600/1602], Avg Loss: 3.9753, Avg Acc: 0.2084
+INFO:local_logger:Epoch[042/300], Step[0600/1602], Avg Loss: 3.9726, Avg Acc: 0.2183
+INFO:local_logger:Epoch[042/300], Step[0650/1602], Avg Loss: 3.9718, Avg Acc: 0.2178
+INFO:local_logger:Epoch[042/300], Step[0650/1602], Avg Loss: 4.0076, Avg Acc: 0.2084
+INFO:local_logger:Epoch[042/300], Step[0650/1602], Avg Loss: 3.9813, Avg Acc: 0.2189
+INFO:local_logger:Epoch[042/300], Step[0650/1602], Avg Loss: 3.9809, Avg Acc: 0.2097
+INFO:master_logger:Epoch[042/300], Step[0650/1602], Avg Loss: 3.9854, Avg Acc: 0.2137
+INFO:local_logger:Epoch[042/300], Step[0700/1602], Avg Loss: 4.0033, Avg Acc: 0.2085
+INFO:local_logger:Epoch[042/300], Step[0700/1602], Avg Loss: 3.9770, Avg Acc: 0.2181
+INFO:local_logger:Epoch[042/300], Step[0700/1602], Avg Loss: 3.9852, Avg Acc: 0.2186
+INFO:local_logger:Epoch[042/300], Step[0700/1602], Avg Loss: 3.9740, Avg Acc: 0.2106
+INFO:master_logger:Epoch[042/300], Step[0700/1602], Avg Loss: 3.9849, Avg Acc: 0.2139
+INFO:local_logger:Epoch[042/300], Step[0750/1602], Avg Loss: 4.0042, Avg Acc: 0.2101
+INFO:local_logger:Epoch[042/300], Step[0750/1602], Avg Loss: 3.9884, Avg Acc: 0.2175
+INFO:local_logger:Epoch[042/300], Step[0750/1602], Avg Loss: 3.9785, Avg Acc: 0.2076
+INFO:local_logger:Epoch[042/300], Step[0750/1602], Avg Loss: 3.9852, Avg Acc: 0.2175
+INFO:master_logger:Epoch[042/300], Step[0750/1602], Avg Loss: 3.9891, Avg Acc: 0.2132
+INFO:local_logger:Epoch[042/300], Step[0800/1602], Avg Loss: 4.0027, Avg Acc: 0.2094
+INFO:local_logger:Epoch[042/300], Step[0800/1602], Avg Loss: 3.9849, Avg Acc: 0.2173
+INFO:local_logger:Epoch[042/300], Step[0800/1602], Avg Loss: 3.9899, Avg Acc: 0.2175
+INFO:master_logger:Epoch[042/300], Step[0800/1602], Avg Loss: 3.9901, Avg Acc: 0.2129
+INFO:local_logger:Epoch[042/300], Step[0800/1602], Avg Loss: 3.9832, Avg Acc: 0.2073
+INFO:local_logger:Epoch[042/300], Step[0850/1602], Avg Loss: 3.9772, Avg Acc: 0.2197
+INFO:local_logger:Epoch[042/300], Step[0850/1602], Avg Loss: 4.0001, Avg Acc: 0.2115
+INFO:local_logger:Epoch[042/300], Step[0850/1602], Avg Loss: 3.9872, Avg Acc: 0.2056
+INFO:local_logger:Epoch[042/300], Step[0850/1602], Avg Loss: 3.9808, Avg Acc: 0.2194
+INFO:master_logger:Epoch[042/300], Step[0850/1602], Avg Loss: 3.9863, Avg Acc: 0.2140
+INFO:local_logger:Epoch[042/300], Step[0900/1602], Avg Loss: 3.9777, Avg Acc: 0.2175
+INFO:local_logger:Epoch[042/300], Step[0900/1602], Avg Loss: 3.9968, Avg Acc: 0.2118
+INFO:local_logger:Epoch[042/300], Step[0900/1602], Avg Loss: 3.9755, Avg Acc: 0.2070
+INFO:local_logger:Epoch[042/300], Step[0900/1602], Avg Loss: 3.9823, Avg Acc: 0.2196
+INFO:master_logger:Epoch[042/300], Step[0900/1602], Avg Loss: 3.9831, Avg Acc: 0.2140
+INFO:local_logger:Epoch[042/300], Step[0950/1602], Avg Loss: 3.9780, Avg Acc: 0.2173
+INFO:local_logger:Epoch[042/300], Step[0950/1602], Avg Loss: 3.9931, Avg Acc: 0.2122
+INFO:local_logger:Epoch[042/300], Step[0950/1602], Avg Loss: 3.9793, Avg Acc: 0.2075
+INFO:local_logger:Epoch[042/300], Step[0950/1602], Avg Loss: 3.9846, Avg Acc: 0.2214
+INFO:master_logger:Epoch[042/300], Step[0950/1602], Avg Loss: 3.9837, Avg Acc: 0.2146
+INFO:local_logger:Epoch[042/300], Step[1000/1602], Avg Loss: 3.9920, Avg Acc: 0.2130
+INFO:master_logger:Epoch[042/300], Step[1000/1602], Avg Loss: 3.9828, Avg Acc: 0.2152
+INFO:local_logger:Epoch[042/300], Step[1000/1602], Avg Loss: 3.9765, Avg Acc: 0.2088
+INFO:local_logger:Epoch[042/300], Step[1000/1602], Avg Loss: 3.9790, Avg Acc: 0.2175
+INFO:local_logger:Epoch[042/300], Step[1000/1602], Avg Loss: 3.9836, Avg Acc: 0.2214
+INFO:local_logger:Epoch[042/300], Step[1050/1602], Avg Loss: 3.9791, Avg Acc: 0.2180
+INFO:local_logger:Epoch[042/300], Step[1050/1602], Avg Loss: 3.9882, Avg Acc: 0.2120
+INFO:local_logger:Epoch[042/300], Step[1050/1602], Avg Loss: 3.9843, Avg Acc: 0.2215
+INFO:master_logger:Epoch[042/300], Step[1050/1602], Avg Loss: 3.9815, Avg Acc: 0.2150
+INFO:local_logger:Epoch[042/300], Step[1050/1602], Avg Loss: 3.9745, Avg Acc: 0.2085
+INFO:local_logger:Epoch[042/300], Step[1100/1602], Avg Loss: 3.9838, Avg Acc: 0.2117
+INFO:local_logger:Epoch[042/300], Step[1100/1602], Avg Loss: 3.9787, Avg Acc: 0.2191
+INFO:local_logger:Epoch[042/300], Step[1100/1602], Avg Loss: 3.9831, Avg Acc: 0.2216
+INFO:local_logger:Epoch[042/300], Step[1100/1602], Avg Loss: 3.9711, Avg Acc: 0.2106
+INFO:master_logger:Epoch[042/300], Step[1100/1602], Avg Loss: 3.9792, Avg Acc: 0.2157
+INFO:local_logger:Epoch[042/300], Step[1150/1602], Avg Loss: 3.9801, Avg Acc: 0.2220
+INFO:local_logger:Epoch[042/300], Step[1150/1602], Avg Loss: 3.9836, Avg Acc: 0.2110
+INFO:local_logger:Epoch[042/300], Step[1150/1602], Avg Loss: 3.9713, Avg Acc: 0.2105
+INFO:master_logger:Epoch[042/300], Step[1150/1602], Avg Loss: 3.9788, Avg Acc: 0.2155
+INFO:local_logger:Epoch[042/300], Step[1150/1602], Avg Loss: 3.9803, Avg Acc: 0.2186
+INFO:local_logger:Epoch[042/300], Step[1200/1602], Avg Loss: 3.9842, Avg Acc: 0.2116
+INFO:local_logger:Epoch[042/300], Step[1200/1602], Avg Loss: 3.9692, Avg Acc: 0.2109
+INFO:local_logger:Epoch[042/300], Step[1200/1602], Avg Loss: 3.9818, Avg Acc: 0.2214
+INFO:local_logger:Epoch[042/300], Step[1200/1602], Avg Loss: 3.9793, Avg Acc: 0.2188
+INFO:master_logger:Epoch[042/300], Step[1200/1602], Avg Loss: 3.9786, Avg Acc: 0.2157
+INFO:local_logger:Epoch[042/300], Step[1250/1602], Avg Loss: 3.9776, Avg Acc: 0.2116
+INFO:local_logger:Epoch[042/300], Step[1250/1602], Avg Loss: 3.9827, Avg Acc: 0.2216
+INFO:local_logger:Epoch[042/300], Step[1250/1602], Avg Loss: 3.9759, Avg Acc: 0.2105
+INFO:local_logger:Epoch[042/300], Step[1250/1602], Avg Loss: 3.9853, Avg Acc: 0.2166
+INFO:master_logger:Epoch[042/300], Step[1250/1602], Avg Loss: 3.9804, Avg Acc: 0.2151
+INFO:local_logger:Epoch[042/300], Step[1300/1602], Avg Loss: 3.9780, Avg Acc: 0.2120
+INFO:local_logger:Epoch[042/300], Step[1300/1602], Avg Loss: 3.9801, Avg Acc: 0.2096
+INFO:local_logger:Epoch[042/300], Step[1300/1602], Avg Loss: 3.9848, Avg Acc: 0.2210
+INFO:local_logger:Epoch[042/300], Step[1300/1602], Avg Loss: 3.9841, Avg Acc: 0.2166
+INFO:master_logger:Epoch[042/300], Step[1300/1602], Avg Loss: 3.9817, Avg Acc: 0.2148
+INFO:local_logger:Epoch[042/300], Step[1350/1602], Avg Loss: 3.9734, Avg Acc: 0.2129
+INFO:local_logger:Epoch[042/300], Step[1350/1602], Avg Loss: 3.9808, Avg Acc: 0.2105
+INFO:local_logger:Epoch[042/300], Step[1350/1602], Avg Loss: 3.9869, Avg Acc: 0.2218
+INFO:local_logger:Epoch[042/300], Step[1350/1602], Avg Loss: 3.9835, Avg Acc: 0.2167
+INFO:master_logger:Epoch[042/300], Step[1350/1602], Avg Loss: 3.9811, Avg Acc: 0.2155
+INFO:local_logger:Epoch[042/300], Step[1400/1602], Avg Loss: 3.9752, Avg Acc: 0.2146
+INFO:local_logger:Epoch[042/300], Step[1400/1602], Avg Loss: 3.9839, Avg Acc: 0.2171
+INFO:local_logger:Epoch[042/300], Step[1400/1602], Avg Loss: 3.9794, Avg Acc: 0.2107
+INFO:local_logger:Epoch[042/300], Step[1400/1602], Avg Loss: 3.9830, Avg Acc: 0.2217
+INFO:master_logger:Epoch[042/300], Step[1400/1602], Avg Loss: 3.9804, Avg Acc: 0.2160
+INFO:local_logger:Epoch[042/300], Step[1450/1602], Avg Loss: 3.9813, Avg Acc: 0.2105
+INFO:local_logger:Epoch[042/300], Step[1450/1602], Avg Loss: 3.9749, Avg Acc: 0.2142
+INFO:local_logger:Epoch[042/300], Step[1450/1602], Avg Loss: 3.9862, Avg Acc: 0.2218
+INFO:master_logger:Epoch[042/300], Step[1450/1602], Avg Loss: 3.9813, Avg Acc: 0.2161
+INFO:local_logger:Epoch[042/300], Step[1450/1602], Avg Loss: 3.9828, Avg Acc: 0.2179
+INFO:local_logger:Epoch[042/300], Step[1500/1602], Avg Loss: 3.9778, Avg Acc: 0.2145
+INFO:local_logger:Epoch[042/300], Step[1500/1602], Avg Loss: 3.9834, Avg Acc: 0.2193
+INFO:local_logger:Epoch[042/300], Step[1500/1602], Avg Loss: 3.9848, Avg Acc: 0.2211
+INFO:local_logger:Epoch[042/300], Step[1500/1602], Avg Loss: 3.9818, Avg Acc: 0.2103
+INFO:master_logger:Epoch[042/300], Step[1500/1602], Avg Loss: 3.9820, Avg Acc: 0.2163
+INFO:local_logger:Epoch[042/300], Step[1550/1602], Avg Loss: 3.9843, Avg Acc: 0.2192
+INFO:local_logger:Epoch[042/300], Step[1550/1602], Avg Loss: 3.9860, Avg Acc: 0.2209
+INFO:local_logger:Epoch[042/300], Step[1550/1602], Avg Loss: 3.9780, Avg Acc: 0.2145
+INFO:local_logger:Epoch[042/300], Step[1550/1602], Avg Loss: 3.9837, Avg Acc: 0.2103
+INFO:master_logger:Epoch[042/300], Step[1550/1602], Avg Loss: 3.9830, Avg Acc: 0.2162
+INFO:local_logger:Epoch[042/300], Step[1600/1602], Avg Loss: 3.9815, Avg Acc: 0.2143
+INFO:local_logger:Epoch[042/300], Step[1600/1602], Avg Loss: 3.9839, Avg Acc: 0.2111
+INFO:local_logger:Epoch[042/300], Step[1600/1602], Avg Loss: 3.9786, Avg Acc: 0.2198
+INFO:master_logger:Epoch[042/300], Step[1600/1602], Avg Loss: 3.9824, Avg Acc: 0.2165
+INFO:local_logger:Epoch[042/300], Step[1600/1602], Avg Loss: 3.9854, Avg Acc: 0.2209
+INFO:local_logger:----- Epoch[042/300], Train Loss: 3.9856, Train Acc: 0.2209, time: 3715.35
+INFO:local_logger:Now training epoch 43. LR=0.000375
+INFO:local_logger:----- Epoch[042/300], Train Loss: 3.9836, Train Acc: 0.2110, time: 3715.61
+INFO:local_logger:Now training epoch 43. LR=0.000375
+INFO:local_logger:----- Epoch[042/300], Train Loss: 3.9816, Train Acc: 0.2143, time: 3715.34
+INFO:master_logger:----- Epoch[042/300], Train Loss: 3.9823, Train Acc: 0.2165, time: 3715.34
+INFO:local_logger:----- Epoch[042/300], Train Loss: 3.9783, Train Acc: 0.2198, time: 3715.49
+INFO:local_logger:Now training epoch 43. LR=0.000375
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-42-Loss-3.9815662214862697.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-42-Loss-3.9815662214862697.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-42-Loss-3.9815662214862697-EMA.pdparams
+INFO:local_logger:Now training epoch 43. LR=0.000375
+INFO:master_logger:Now training epoch 43. LR=0.000375
+INFO:local_logger:Epoch[043/300], Step[0000/1602], Avg Loss: 3.6645, Avg Acc: 0.3900
+INFO:local_logger:Epoch[043/300], Step[0000/1602], Avg Loss: 4.4737, Avg Acc: 0.2000
+INFO:local_logger:Epoch[043/300], Step[0000/1602], Avg Loss: 3.6497, Avg Acc: 0.3400
+INFO:master_logger:Epoch[043/300], Step[0000/1602], Avg Loss: 4.1105, Avg Acc: 0.2537
+INFO:local_logger:Epoch[043/300], Step[0000/1602], Avg Loss: 4.6539, Avg Acc: 0.0850
+INFO:local_logger:Epoch[043/300], Step[0050/1602], Avg Loss: 3.9754, Avg Acc: 0.1957
+INFO:local_logger:Epoch[043/300], Step[0050/1602], Avg Loss: 3.9231, Avg Acc: 0.1993
+INFO:local_logger:Epoch[043/300], Step[0050/1602], Avg Loss: 3.9457, Avg Acc: 0.2068
+INFO:local_logger:Epoch[043/300], Step[0050/1602], Avg Loss: 3.8983, Avg Acc: 0.2275
+INFO:master_logger:Epoch[043/300], Step[0050/1602], Avg Loss: 3.9356, Avg Acc: 0.2073
+INFO:local_logger:Epoch[043/300], Step[0100/1602], Avg Loss: 3.9553, Avg Acc: 0.2075
+INFO:local_logger:Epoch[043/300], Step[0100/1602], Avg Loss: 3.9853, Avg Acc: 0.1947
+INFO:local_logger:Epoch[043/300], Step[0100/1602], Avg Loss: 3.9215, Avg Acc: 0.2235
+INFO:local_logger:Epoch[043/300], Step[0100/1602], Avg Loss: 3.9257, Avg Acc: 0.2375
+INFO:master_logger:Epoch[043/300], Step[0100/1602], Avg Loss: 3.9469, Avg Acc: 0.2158
+INFO:local_logger:Epoch[043/300], Step[0150/1602], Avg Loss: 3.9318, Avg Acc: 0.2190
+INFO:local_logger:Epoch[043/300], Step[0150/1602], Avg Loss: 3.9469, Avg Acc: 0.2233
+INFO:local_logger:Epoch[043/300], Step[0150/1602], Avg Loss: 3.9750, Avg Acc: 0.2048
+INFO:local_logger:Epoch[043/300], Step[0150/1602], Avg Loss: 3.9483, Avg Acc: 0.2291
+INFO:master_logger:Epoch[043/300], Step[0150/1602], Avg Loss: 3.9505, Avg Acc: 0.2190
+INFO:local_logger:Epoch[043/300], Step[0200/1602], Avg Loss: 3.9474, Avg Acc: 0.2128
+INFO:local_logger:Epoch[043/300], Step[0200/1602], Avg Loss: 3.9662, Avg Acc: 0.2242
+INFO:local_logger:Epoch[043/300], Step[0200/1602], Avg Loss: 3.9678, Avg Acc: 0.2249
+INFO:local_logger:Epoch[043/300], Step[0200/1602], Avg Loss: 3.9901, Avg Acc: 0.2179
+INFO:master_logger:Epoch[043/300], Step[0200/1602], Avg Loss: 3.9679, Avg Acc: 0.2200
+INFO:local_logger:Epoch[043/300], Step[0250/1602], Avg Loss: 3.9542, Avg Acc: 0.2168
+INFO:local_logger:Epoch[043/300], Step[0250/1602], Avg Loss: 3.9682, Avg Acc: 0.2168
+INFO:local_logger:Epoch[043/300], Step[0250/1602], Avg Loss: 3.9710, Avg Acc: 0.2224
+INFO:local_logger:Epoch[043/300], Step[0250/1602], Avg Loss: 4.0054, Avg Acc: 0.2143
+INFO:master_logger:Epoch[043/300], Step[0250/1602], Avg Loss: 3.9747, Avg Acc: 0.2176
+INFO:local_logger:Epoch[043/300], Step[0300/1602], Avg Loss: 3.9594, Avg Acc: 0.2189
+INFO:local_logger:Epoch[043/300], Step[0300/1602], Avg Loss: 3.9725, Avg Acc: 0.2094
+INFO:local_logger:Epoch[043/300], Step[0300/1602], Avg Loss: 4.0052, Avg Acc: 0.2154
+INFO:master_logger:Epoch[043/300], Step[0300/1602], Avg Loss: 3.9784, Avg Acc: 0.2160
+INFO:local_logger:Epoch[043/300], Step[0300/1602], Avg Loss: 3.9767, Avg Acc: 0.2205
+INFO:local_logger:Epoch[043/300], Step[0350/1602], Avg Loss: 3.9761, Avg Acc: 0.2188
+INFO:local_logger:Epoch[043/300], Step[0350/1602], Avg Loss: 3.9673, Avg Acc: 0.2049
+INFO:local_logger:Epoch[043/300], Step[0350/1602], Avg Loss: 3.9871, Avg Acc: 0.2190
+INFO:local_logger:Epoch[043/300], Step[0350/1602], Avg Loss: 4.0130, Avg Acc: 0.2148
+INFO:master_logger:Epoch[043/300], Step[0350/1602], Avg Loss: 3.9859, Avg Acc: 0.2144
+INFO:local_logger:Epoch[043/300], Step[0400/1602], Avg Loss: 3.9697, Avg Acc: 0.2081
+INFO:local_logger:Epoch[043/300], Step[0400/1602], Avg Loss: 3.9758, Avg Acc: 0.2199
+INFO:local_logger:Epoch[043/300], Step[0400/1602], Avg Loss: 3.9589, Avg Acc: 0.2233
+INFO:local_logger:Epoch[043/300], Step[0400/1602], Avg Loss: 4.0157, Avg Acc: 0.2114
+INFO:master_logger:Epoch[043/300], Step[0400/1602], Avg Loss: 3.9800, Avg Acc: 0.2157
+INFO:local_logger:Epoch[043/300], Step[0450/1602], Avg Loss: 3.9892, Avg Acc: 0.2199
+INFO:local_logger:Epoch[043/300], Step[0450/1602], Avg Loss: 4.0070, Avg Acc: 0.2111
+INFO:local_logger:Epoch[043/300], Step[0450/1602], Avg Loss: 3.9528, Avg Acc: 0.2213
+INFO:local_logger:Epoch[043/300], Step[0450/1602], Avg Loss: 3.9721, Avg Acc: 0.2065
+INFO:master_logger:Epoch[043/300], Step[0450/1602], Avg Loss: 3.9803, Avg Acc: 0.2147
+INFO:local_logger:Epoch[043/300], Step[0500/1602], Avg Loss: 4.0047, Avg Acc: 0.2089
+INFO:local_logger:Epoch[043/300], Step[0500/1602], Avg Loss: 3.9630, Avg Acc: 0.2210
+INFO:local_logger:Epoch[043/300], Step[0500/1602], Avg Loss: 3.9568, Avg Acc: 0.2223
+INFO:local_logger:Epoch[043/300], Step[0500/1602], Avg Loss: 3.9718, Avg Acc: 0.2073
+INFO:master_logger:Epoch[043/300], Step[0500/1602], Avg Loss: 3.9741, Avg Acc: 0.2149
+INFO:local_logger:Epoch[043/300], Step[0550/1602], Avg Loss: 3.9657, Avg Acc: 0.2172
+INFO:local_logger:Epoch[043/300], Step[0550/1602], Avg Loss: 3.9606, Avg Acc: 0.2216
+INFO:local_logger:Epoch[043/300], Step[0550/1602], Avg Loss: 3.9735, Avg Acc: 0.2070
+INFO:local_logger:Epoch[043/300], Step[0550/1602], Avg Loss: 4.0068, Avg Acc: 0.2082
+INFO:master_logger:Epoch[043/300], Step[0550/1602], Avg Loss: 3.9766, Avg Acc: 0.2135
+INFO:local_logger:Epoch[043/300], Step[0600/1602], Avg Loss: 3.9630, Avg Acc: 0.2172
+INFO:local_logger:Epoch[043/300], Step[0600/1602], Avg Loss: 3.9732, Avg Acc: 0.2073
+INFO:local_logger:Epoch[043/300], Step[0600/1602], Avg Loss: 3.9534, Avg Acc: 0.2189
+INFO:local_logger:Epoch[043/300], Step[0600/1602], Avg Loss: 4.0058, Avg Acc: 0.2087
+INFO:master_logger:Epoch[043/300], Step[0600/1602], Avg Loss: 3.9738, Avg Acc: 0.2130
+INFO:local_logger:Epoch[043/300], Step[0650/1602], Avg Loss: 4.0044, Avg Acc: 0.2063
+INFO:local_logger:Epoch[043/300], Step[0650/1602], Avg Loss: 3.9737, Avg Acc: 0.2167
+INFO:local_logger:Epoch[043/300], Step[0650/1602], Avg Loss: 3.9607, Avg Acc: 0.2183
+INFO:local_logger:Epoch[043/300], Step[0650/1602], Avg Loss: 3.9585, Avg Acc: 0.2097
+INFO:master_logger:Epoch[043/300], Step[0650/1602], Avg Loss: 3.9743, Avg Acc: 0.2127
+INFO:local_logger:Epoch[043/300], Step[0700/1602], Avg Loss: 3.9676, Avg Acc: 0.2167
+INFO:local_logger:Epoch[043/300], Step[0700/1602], Avg Loss: 3.9957, Avg Acc: 0.2086
+INFO:local_logger:Epoch[043/300], Step[0700/1602], Avg Loss: 3.9622, Avg Acc: 0.2102
+INFO:local_logger:Epoch[043/300], Step[0700/1602], Avg Loss: 3.9654, Avg Acc: 0.2185
+INFO:master_logger:Epoch[043/300], Step[0700/1602], Avg Loss: 3.9727, Avg Acc: 0.2135
+INFO:local_logger:Epoch[043/300], Step[0750/1602], Avg Loss: 3.9699, Avg Acc: 0.2167
+INFO:local_logger:Epoch[043/300], Step[0750/1602], Avg Loss: 3.9891, Avg Acc: 0.2073
+INFO:local_logger:Epoch[043/300], Step[0750/1602], Avg Loss: 3.9674, Avg Acc: 0.2212
+INFO:local_logger:Epoch[043/300], Step[0750/1602], Avg Loss: 3.9645, Avg Acc: 0.2114
+INFO:master_logger:Epoch[043/300], Step[0750/1602], Avg Loss: 3.9727, Avg Acc: 0.2141
+INFO:local_logger:Epoch[043/300], Step[0800/1602], Avg Loss: 3.9737, Avg Acc: 0.2185
+INFO:local_logger:Epoch[043/300], Step[0800/1602], Avg Loss: 3.9621, Avg Acc: 0.2183
+INFO:local_logger:Epoch[043/300], Step[0800/1602], Avg Loss: 3.9625, Avg Acc: 0.2089
+INFO:local_logger:Epoch[043/300], Step[0800/1602], Avg Loss: 3.9940, Avg Acc: 0.2076
+INFO:master_logger:Epoch[043/300], Step[0800/1602], Avg Loss: 3.9731, Avg Acc: 0.2133
+INFO:local_logger:Epoch[043/300], Step[0850/1602], Avg Loss: 3.9628, Avg Acc: 0.2178
+INFO:local_logger:Epoch[043/300], Step[0850/1602], Avg Loss: 3.9759, Avg Acc: 0.2188
+INFO:local_logger:Epoch[043/300], Step[0850/1602], Avg Loss: 3.9618, Avg Acc: 0.2105
+INFO:local_logger:Epoch[043/300], Step[0850/1602], Avg Loss: 3.9937, Avg Acc: 0.2090
+INFO:master_logger:Epoch[043/300], Step[0850/1602], Avg Loss: 3.9735, Avg Acc: 0.2140
+INFO:local_logger:Epoch[043/300], Step[0900/1602], Avg Loss: 3.9979, Avg Acc: 0.2100
+INFO:local_logger:Epoch[043/300], Step[0900/1602], Avg Loss: 3.9599, Avg Acc: 0.2186
+INFO:local_logger:Epoch[043/300], Step[0900/1602], Avg Loss: 3.9723, Avg Acc: 0.2217
+INFO:local_logger:Epoch[043/300], Step[0900/1602], Avg Loss: 3.9660, Avg Acc: 0.2101
+INFO:master_logger:Epoch[043/300], Step[0900/1602], Avg Loss: 3.9740, Avg Acc: 0.2151
+INFO:local_logger:Epoch[043/300], Step[0950/1602], Avg Loss: 3.9710, Avg Acc: 0.2169
+INFO:master_logger:Epoch[043/300], Step[0950/1602], Avg Loss: 3.9765, Avg Acc: 0.2153
+INFO:local_logger:Epoch[043/300], Step[0950/1602], Avg Loss: 3.9711, Avg Acc: 0.2229
+INFO:local_logger:Epoch[043/300], Step[0950/1602], Avg Loss: 3.9666, Avg Acc: 0.2100
+INFO:local_logger:Epoch[043/300], Step[0950/1602], Avg Loss: 3.9972, Avg Acc: 0.2114
+INFO:local_logger:Epoch[043/300], Step[1000/1602], Avg Loss: 3.9736, Avg Acc: 0.2163
+INFO:local_logger:Epoch[043/300], Step[1000/1602], Avg Loss: 3.9749, Avg Acc: 0.2231
+INFO:local_logger:Epoch[043/300], Step[1000/1602], Avg Loss: 4.0051, Avg Acc: 0.2124
+INFO:master_logger:Epoch[043/300], Step[1000/1602], Avg Loss: 3.9794, Avg Acc: 0.2156
+INFO:local_logger:Epoch[043/300], Step[1000/1602], Avg Loss: 3.9640, Avg Acc: 0.2107
+INFO:local_logger:Epoch[043/300], Step[1050/1602], Avg Loss: 3.9601, Avg Acc: 0.2099
+INFO:local_logger:Epoch[043/300], Step[1050/1602], Avg Loss: 3.9725, Avg Acc: 0.2241
+INFO:local_logger:Epoch[043/300], Step[1050/1602], Avg Loss: 4.0020, Avg Acc: 0.2115
+INFO:local_logger:Epoch[043/300], Step[1050/1602], Avg Loss: 3.9729, Avg Acc: 0.2166
+INFO:master_logger:Epoch[043/300], Step[1050/1602], Avg Loss: 3.9769, Avg Acc: 0.2155
+INFO:local_logger:Epoch[043/300], Step[1100/1602], Avg Loss: 3.9708, Avg Acc: 0.2161
+INFO:local_logger:Epoch[043/300], Step[1100/1602], Avg Loss: 3.9669, Avg Acc: 0.2238
+INFO:local_logger:Epoch[043/300], Step[1100/1602], Avg Loss: 3.9592, Avg Acc: 0.2112
+INFO:local_logger:Epoch[043/300], Step[1100/1602], Avg Loss: 3.9977, Avg Acc: 0.2129
+INFO:master_logger:Epoch[043/300], Step[1100/1602], Avg Loss: 3.9737, Avg Acc: 0.2160
+INFO:local_logger:Epoch[043/300], Step[1150/1602], Avg Loss: 3.9599, Avg Acc: 0.2176
+INFO:local_logger:Epoch[043/300], Step[1150/1602], Avg Loss: 3.9653, Avg Acc: 0.2230
+INFO:local_logger:Epoch[043/300], Step[1150/1602], Avg Loss: 3.9574, Avg Acc: 0.2127
+INFO:local_logger:Epoch[043/300], Step[1150/1602], Avg Loss: 3.9954, Avg Acc: 0.2141
+INFO:master_logger:Epoch[043/300], Step[1150/1602], Avg Loss: 3.9695, Avg Acc: 0.2168
+INFO:local_logger:Epoch[043/300], Step[1200/1602], Avg Loss: 3.9614, Avg Acc: 0.2171
+INFO:local_logger:Epoch[043/300], Step[1200/1602], Avg Loss: 3.9577, Avg Acc: 0.2133
+INFO:local_logger:Epoch[043/300], Step[1200/1602], Avg Loss: 3.9965, Avg Acc: 0.2130
+INFO:local_logger:Epoch[043/300], Step[1200/1602], Avg Loss: 3.9624, Avg Acc: 0.2233
+INFO:master_logger:Epoch[043/300], Step[1200/1602], Avg Loss: 3.9695, Avg Acc: 0.2167
+INFO:local_logger:Epoch[043/300], Step[1250/1602], Avg Loss: 3.9559, Avg Acc: 0.2161
+INFO:local_logger:Epoch[043/300], Step[1250/1602], Avg Loss: 3.9964, Avg Acc: 0.2137
+INFO:master_logger:Epoch[043/300], Step[1250/1602], Avg Loss: 3.9676, Avg Acc: 0.2168
+INFO:local_logger:Epoch[043/300], Step[1250/1602], Avg Loss: 3.9568, Avg Acc: 0.2140
+INFO:local_logger:Epoch[043/300], Step[1250/1602], Avg Loss: 3.9615, Avg Acc: 0.2234
+INFO:local_logger:Epoch[043/300], Step[1300/1602], Avg Loss: 3.9569, Avg Acc: 0.2140
+INFO:local_logger:Epoch[043/300], Step[1300/1602], Avg Loss: 3.9624, Avg Acc: 0.2234
+INFO:local_logger:Epoch[043/300], Step[1300/1602], Avg Loss: 3.9577, Avg Acc: 0.2142
+INFO:master_logger:Epoch[043/300], Step[1300/1602], Avg Loss: 3.9680, Avg Acc: 0.2162
+INFO:local_logger:Epoch[043/300], Step[1300/1602], Avg Loss: 3.9949, Avg Acc: 0.2131
+INFO:local_logger:Epoch[043/300], Step[1350/1602], Avg Loss: 3.9925, Avg Acc: 0.2139
+INFO:local_logger:Epoch[043/300], Step[1350/1602], Avg Loss: 3.9559, Avg Acc: 0.2144
+INFO:local_logger:Epoch[043/300], Step[1350/1602], Avg Loss: 3.9587, Avg Acc: 0.2156
+INFO:local_logger:Epoch[043/300], Step[1350/1602], Avg Loss: 3.9599, Avg Acc: 0.2237
+INFO:master_logger:Epoch[043/300], Step[1350/1602], Avg Loss: 3.9668, Avg Acc: 0.2169
+INFO:local_logger:Epoch[043/300], Step[1400/1602], Avg Loss: 3.9601, Avg Acc: 0.2151
+INFO:local_logger:Epoch[043/300], Step[1400/1602], Avg Loss: 3.9886, Avg Acc: 0.2142
+INFO:local_logger:Epoch[043/300], Step[1400/1602], Avg Loss: 3.9564, Avg Acc: 0.2237
+INFO:local_logger:Epoch[043/300], Step[1400/1602], Avg Loss: 3.9610, Avg Acc: 0.2145
+INFO:master_logger:Epoch[043/300], Step[1400/1602], Avg Loss: 3.9665, Avg Acc: 0.2169
+INFO:local_logger:Epoch[043/300], Step[1450/1602], Avg Loss: 3.9622, Avg Acc: 0.2142
+INFO:local_logger:Epoch[043/300], Step[1450/1602], Avg Loss: 3.9898, Avg Acc: 0.2137
+INFO:local_logger:Epoch[043/300], Step[1450/1602], Avg Loss: 3.9573, Avg Acc: 0.2144
+INFO:local_logger:Epoch[043/300], Step[1450/1602], Avg Loss: 3.9529, Avg Acc: 0.2226
+INFO:master_logger:Epoch[043/300], Step[1450/1602], Avg Loss: 3.9655, Avg Acc: 0.2162
+INFO:local_logger:Epoch[043/300], Step[1500/1602], Avg Loss: 3.9599, Avg Acc: 0.2152
+INFO:local_logger:Epoch[043/300], Step[1500/1602], Avg Loss: 3.9658, Avg Acc: 0.2135
+INFO:local_logger:Epoch[043/300], Step[1500/1602], Avg Loss: 3.9471, Avg Acc: 0.2221
+INFO:local_logger:Epoch[043/300], Step[1500/1602], Avg Loss: 3.9871, Avg Acc: 0.2152
+INFO:master_logger:Epoch[043/300], Step[1500/1602], Avg Loss: 3.9650, Avg Acc: 0.2165
+INFO:local_logger:Epoch[043/300], Step[1550/1602], Avg Loss: 3.9620, Avg Acc: 0.2148
+INFO:local_logger:Epoch[043/300], Step[1550/1602], Avg Loss: 3.9830, Avg Acc: 0.2172
+INFO:local_logger:Epoch[043/300], Step[1550/1602], Avg Loss: 3.9672, Avg Acc: 0.2125
+INFO:master_logger:Epoch[043/300], Step[1550/1602], Avg Loss: 3.9641, Avg Acc: 0.2168
+INFO:local_logger:Epoch[043/300], Step[1550/1602], Avg Loss: 3.9444, Avg Acc: 0.2227
+INFO:local_logger:Epoch[043/300], Step[1600/1602], Avg Loss: 3.9666, Avg Acc: 0.2135
+INFO:local_logger:Epoch[043/300], Step[1600/1602], Avg Loss: 3.9844, Avg Acc: 0.2172
+INFO:local_logger:Epoch[043/300], Step[1600/1602], Avg Loss: 3.9627, Avg Acc: 0.2141
+INFO:local_logger:Epoch[043/300], Step[1600/1602], Avg Loss: 3.9444, Avg Acc: 0.2227
+INFO:master_logger:Epoch[043/300], Step[1600/1602], Avg Loss: 3.9646, Avg Acc: 0.2169
+INFO:local_logger:----- Epoch[043/300], Train Loss: 3.9843, Train Acc: 0.2172, time: 3705.14
+INFO:local_logger:Now training epoch 44. LR=0.000374
+INFO:local_logger:----- Epoch[043/300], Train Loss: 3.9664, Train Acc: 0.2134, time: 3705.17
+INFO:local_logger:Now training epoch 44. LR=0.000374
+INFO:local_logger:----- Epoch[043/300], Train Loss: 3.9444, Train Acc: 0.2227, time: 3705.28
+INFO:local_logger:Now training epoch 44. LR=0.000374
+INFO:local_logger:----- Epoch[043/300], Train Loss: 3.9625, Train Acc: 0.2142, time: 3704.92
+INFO:master_logger:----- Epoch[043/300], Train Loss: 3.9644, Train Acc: 0.2169, time: 3704.92
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-43-Loss-3.9624913584265506.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-43-Loss-3.9624913584265506.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-43-Loss-3.9624913584265506-EMA.pdparams
+INFO:local_logger:Now training epoch 44. LR=0.000374
+INFO:master_logger:Now training epoch 44. LR=0.000374
+INFO:local_logger:Epoch[044/300], Step[0000/1602], Avg Loss: 3.1053, Avg Acc: 0.4150
+INFO:master_logger:Epoch[044/300], Step[0000/1602], Avg Loss: 3.6172, Avg Acc: 0.3312
+INFO:local_logger:Epoch[044/300], Step[0000/1602], Avg Loss: 3.0752, Avg Acc: 0.4000
+INFO:local_logger:Epoch[044/300], Step[0000/1602], Avg Loss: 3.9798, Avg Acc: 0.3050
+INFO:local_logger:Epoch[044/300], Step[0000/1602], Avg Loss: 4.3085, Avg Acc: 0.2050
+INFO:local_logger:Epoch[044/300], Step[0050/1602], Avg Loss: 3.8499, Avg Acc: 0.2131
+INFO:local_logger:Epoch[044/300], Step[0050/1602], Avg Loss: 4.0201, Avg Acc: 0.2399
+INFO:local_logger:Epoch[044/300], Step[0050/1602], Avg Loss: 4.0795, Avg Acc: 0.2186
+INFO:local_logger:Epoch[044/300], Step[0050/1602], Avg Loss: 3.9622, Avg Acc: 0.2324
+INFO:master_logger:Epoch[044/300], Step[0050/1602], Avg Loss: 3.9780, Avg Acc: 0.2260
+INFO:local_logger:Epoch[044/300], Step[0100/1602], Avg Loss: 3.9743, Avg Acc: 0.2018
+INFO:local_logger:Epoch[044/300], Step[0100/1602], Avg Loss: 3.9766, Avg Acc: 0.2099
+INFO:local_logger:Epoch[044/300], Step[0100/1602], Avg Loss: 3.9466, Avg Acc: 0.2382
+INFO:master_logger:Epoch[044/300], Step[0100/1602], Avg Loss: 3.9664, Avg Acc: 0.2222
+INFO:local_logger:Epoch[044/300], Step[0100/1602], Avg Loss: 3.9679, Avg Acc: 0.2389
+INFO:local_logger:Epoch[044/300], Step[0150/1602], Avg Loss: 3.9257, Avg Acc: 0.2039
+INFO:local_logger:Epoch[044/300], Step[0150/1602], Avg Loss: 3.9247, Avg Acc: 0.2168
+INFO:local_logger:Epoch[044/300], Step[0150/1602], Avg Loss: 3.9526, Avg Acc: 0.2283
+INFO:local_logger:Epoch[044/300], Step[0150/1602], Avg Loss: 3.9467, Avg Acc: 0.2397
+INFO:master_logger:Epoch[044/300], Step[0150/1602], Avg Loss: 3.9374, Avg Acc: 0.2222
+INFO:local_logger:Epoch[044/300], Step[0200/1602], Avg Loss: 3.9381, Avg Acc: 0.2035
+INFO:local_logger:Epoch[044/300], Step[0200/1602], Avg Loss: 3.9586, Avg Acc: 0.2313
+INFO:local_logger:Epoch[044/300], Step[0200/1602], Avg Loss: 3.9920, Avg Acc: 0.2248
+INFO:local_logger:Epoch[044/300], Step[0200/1602], Avg Loss: 3.8925, Avg Acc: 0.2203
+INFO:master_logger:Epoch[044/300], Step[0200/1602], Avg Loss: 3.9453, Avg Acc: 0.2200
+INFO:local_logger:Epoch[044/300], Step[0250/1602], Avg Loss: 3.9441, Avg Acc: 0.2043
+INFO:local_logger:Epoch[044/300], Step[0250/1602], Avg Loss: 3.9678, Avg Acc: 0.2278
+INFO:local_logger:Epoch[044/300], Step[0250/1602], Avg Loss: 3.8967, Avg Acc: 0.2246
+INFO:master_logger:Epoch[044/300], Step[0250/1602], Avg Loss: 3.9442, Avg Acc: 0.2224
+INFO:local_logger:Epoch[044/300], Step[0250/1602], Avg Loss: 3.9683, Avg Acc: 0.2328
+INFO:local_logger:Epoch[044/300], Step[0300/1602], Avg Loss: 3.9348, Avg Acc: 0.1996
+INFO:local_logger:Epoch[044/300], Step[0300/1602], Avg Loss: 3.9556, Avg Acc: 0.2219
+INFO:local_logger:Epoch[044/300], Step[0300/1602], Avg Loss: 3.9018, Avg Acc: 0.2234
+INFO:local_logger:Epoch[044/300], Step[0300/1602], Avg Loss: 3.9653, Avg Acc: 0.2287
+INFO:master_logger:Epoch[044/300], Step[0300/1602], Avg Loss: 3.9394, Avg Acc: 0.2184
+INFO:local_logger:Epoch[044/300], Step[0350/1602], Avg Loss: 3.9364, Avg Acc: 0.2046
+INFO:local_logger:Epoch[044/300], Step[0350/1602], Avg Loss: 3.9049, Avg Acc: 0.2230
+INFO:local_logger:Epoch[044/300], Step[0350/1602], Avg Loss: 3.9539, Avg Acc: 0.2191
+INFO:local_logger:Epoch[044/300], Step[0350/1602], Avg Loss: 3.9673, Avg Acc: 0.2305
+INFO:master_logger:Epoch[044/300], Step[0350/1602], Avg Loss: 3.9406, Avg Acc: 0.2193
+INFO:local_logger:Epoch[044/300], Step[0400/1602], Avg Loss: 3.9303, Avg Acc: 0.2052
+INFO:local_logger:Epoch[044/300], Step[0400/1602], Avg Loss: 3.9196, Avg Acc: 0.2206
+INFO:local_logger:Epoch[044/300], Step[0400/1602], Avg Loss: 3.9577, Avg Acc: 0.2204
+INFO:local_logger:Epoch[044/300], Step[0400/1602], Avg Loss: 3.9441, Avg Acc: 0.2311
+INFO:master_logger:Epoch[044/300], Step[0400/1602], Avg Loss: 3.9379, Avg Acc: 0.2194
+INFO:local_logger:Epoch[044/300], Step[0450/1602], Avg Loss: 3.9406, Avg Acc: 0.2074
+INFO:local_logger:Epoch[044/300], Step[0450/1602], Avg Loss: 3.9657, Avg Acc: 0.2145
+INFO:local_logger:Epoch[044/300], Step[0450/1602], Avg Loss: 3.9368, Avg Acc: 0.2289
+INFO:local_logger:Epoch[044/300], Step[0450/1602], Avg Loss: 3.9316, Avg Acc: 0.2162
+INFO:master_logger:Epoch[044/300], Step[0450/1602], Avg Loss: 3.9437, Avg Acc: 0.2167
+INFO:local_logger:Epoch[044/300], Step[0500/1602], Avg Loss: 3.9325, Avg Acc: 0.2095
+INFO:local_logger:Epoch[044/300], Step[0500/1602], Avg Loss: 3.9311, Avg Acc: 0.2276
+INFO:local_logger:Epoch[044/300], Step[0500/1602], Avg Loss: 3.9553, Avg Acc: 0.2154
+INFO:local_logger:Epoch[044/300], Step[0500/1602], Avg Loss: 3.9313, Avg Acc: 0.2188
+INFO:master_logger:Epoch[044/300], Step[0500/1602], Avg Loss: 3.9376, Avg Acc: 0.2179
+INFO:local_logger:Epoch[044/300], Step[0550/1602], Avg Loss: 3.9255, Avg Acc: 0.2126
+INFO:local_logger:Epoch[044/300], Step[0550/1602], Avg Loss: 3.9577, Avg Acc: 0.2147
+INFO:local_logger:Epoch[044/300], Step[0550/1602], Avg Loss: 3.9437, Avg Acc: 0.2266
+INFO:master_logger:Epoch[044/300], Step[0550/1602], Avg Loss: 3.9382, Avg Acc: 0.2189
+INFO:local_logger:Epoch[044/300], Step[0550/1602], Avg Loss: 3.9258, Avg Acc: 0.2218
+INFO:local_logger:Epoch[044/300], Step[0600/1602], Avg Loss: 3.9211, Avg Acc: 0.2140
+INFO:local_logger:Epoch[044/300], Step[0600/1602], Avg Loss: 3.9297, Avg Acc: 0.2207
+INFO:local_logger:Epoch[044/300], Step[0600/1602], Avg Loss: 3.9584, Avg Acc: 0.2152
+INFO:local_logger:Epoch[044/300], Step[0600/1602], Avg Loss: 3.9552, Avg Acc: 0.2219
+INFO:master_logger:Epoch[044/300], Step[0600/1602], Avg Loss: 3.9411, Avg Acc: 0.2179
+INFO:local_logger:Epoch[044/300], Step[0650/1602], Avg Loss: 3.9149, Avg Acc: 0.2137
+INFO:local_logger:Epoch[044/300], Step[0650/1602], Avg Loss: 3.9638, Avg Acc: 0.2222
+INFO:local_logger:Epoch[044/300], Step[0650/1602], Avg Loss: 3.9615, Avg Acc: 0.2134
+INFO:local_logger:Epoch[044/300], Step[0650/1602], Avg Loss: 3.9356, Avg Acc: 0.2194
+INFO:master_logger:Epoch[044/300], Step[0650/1602], Avg Loss: 3.9440, Avg Acc: 0.2172
+INFO:local_logger:Epoch[044/300], Step[0700/1602], Avg Loss: 3.9218, Avg Acc: 0.2109
+INFO:local_logger:Epoch[044/300], Step[0700/1602], Avg Loss: 3.9689, Avg Acc: 0.2207
+INFO:local_logger:Epoch[044/300], Step[0700/1602], Avg Loss: 3.9433, Avg Acc: 0.2202
+INFO:local_logger:Epoch[044/300], Step[0700/1602], Avg Loss: 3.9618, Avg Acc: 0.2131
+INFO:master_logger:Epoch[044/300], Step[0700/1602], Avg Loss: 3.9489, Avg Acc: 0.2162
+INFO:local_logger:Epoch[044/300], Step[0750/1602], Avg Loss: 3.9247, Avg Acc: 0.2140
+INFO:local_logger:Epoch[044/300], Step[0750/1602], Avg Loss: 3.9614, Avg Acc: 0.2153
+INFO:local_logger:Epoch[044/300], Step[0750/1602], Avg Loss: 3.9442, Avg Acc: 0.2183
+INFO:local_logger:Epoch[044/300], Step[0750/1602], Avg Loss: 3.9664, Avg Acc: 0.2194
+INFO:master_logger:Epoch[044/300], Step[0750/1602], Avg Loss: 3.9492, Avg Acc: 0.2167
+INFO:local_logger:Epoch[044/300], Step[0800/1602], Avg Loss: 3.9291, Avg Acc: 0.2143
+INFO:local_logger:Epoch[044/300], Step[0800/1602], Avg Loss: 3.9635, Avg Acc: 0.2186
+INFO:local_logger:Epoch[044/300], Step[0800/1602], Avg Loss: 3.9632, Avg Acc: 0.2142
+INFO:local_logger:Epoch[044/300], Step[0800/1602], Avg Loss: 3.9381, Avg Acc: 0.2188
+INFO:master_logger:Epoch[044/300], Step[0800/1602], Avg Loss: 3.9485, Avg Acc: 0.2165
+INFO:local_logger:Epoch[044/300], Step[0850/1602], Avg Loss: 3.9663, Avg Acc: 0.2169
+INFO:local_logger:Epoch[044/300], Step[0850/1602], Avg Loss: 3.9367, Avg Acc: 0.2210
+INFO:local_logger:Epoch[044/300], Step[0850/1602], Avg Loss: 3.9260, Avg Acc: 0.2139
+INFO:local_logger:Epoch[044/300], Step[0850/1602], Avg Loss: 3.9631, Avg Acc: 0.2162
+INFO:master_logger:Epoch[044/300], Step[0850/1602], Avg Loss: 3.9480, Avg Acc: 0.2170
+INFO:local_logger:Epoch[044/300], Step[0900/1602], Avg Loss: 3.9188, Avg Acc: 0.2146
+INFO:local_logger:Epoch[044/300], Step[0900/1602], Avg Loss: 3.9396, Avg Acc: 0.2209
+INFO:local_logger:Epoch[044/300], Step[0900/1602], Avg Loss: 3.9626, Avg Acc: 0.2181
+INFO:local_logger:Epoch[044/300], Step[0900/1602], Avg Loss: 3.9633, Avg Acc: 0.2169
+INFO:master_logger:Epoch[044/300], Step[0900/1602], Avg Loss: 3.9461, Avg Acc: 0.2176
+INFO:local_logger:Epoch[044/300], Step[0950/1602], Avg Loss: 3.9418, Avg Acc: 0.2209
+INFO:local_logger:Epoch[044/300], Step[0950/1602], Avg Loss: 3.9596, Avg Acc: 0.2178
+INFO:local_logger:Epoch[044/300], Step[0950/1602], Avg Loss: 3.9220, Avg Acc: 0.2119
+INFO:local_logger:Epoch[044/300], Step[0950/1602], Avg Loss: 3.9591, Avg Acc: 0.2180
+INFO:master_logger:Epoch[044/300], Step[0950/1602], Avg Loss: 3.9456, Avg Acc: 0.2172
+INFO:local_logger:Epoch[044/300], Step[1000/1602], Avg Loss: 3.9262, Avg Acc: 0.2118
+INFO:local_logger:Epoch[044/300], Step[1000/1602], Avg Loss: 3.9562, Avg Acc: 0.2176
+INFO:local_logger:Epoch[044/300], Step[1000/1602], Avg Loss: 3.9442, Avg Acc: 0.2209
+INFO:local_logger:Epoch[044/300], Step[1000/1602], Avg Loss: 3.9623, Avg Acc: 0.2173
+INFO:master_logger:Epoch[044/300], Step[1000/1602], Avg Loss: 3.9472, Avg Acc: 0.2169
+INFO:local_logger:Epoch[044/300], Step[1050/1602], Avg Loss: 3.9298, Avg Acc: 0.2106
+INFO:local_logger:Epoch[044/300], Step[1050/1602], Avg Loss: 3.9565, Avg Acc: 0.2177
+INFO:local_logger:Epoch[044/300], Step[1050/1602], Avg Loss: 3.9412, Avg Acc: 0.2212
+INFO:local_logger:Epoch[044/300], Step[1050/1602], Avg Loss: 3.9576, Avg Acc: 0.2181
+INFO:master_logger:Epoch[044/300], Step[1050/1602], Avg Loss: 3.9463, Avg Acc: 0.2169
+INFO:local_logger:Epoch[044/300], Step[1100/1602], Avg Loss: 3.9287, Avg Acc: 0.2111
+INFO:local_logger:Epoch[044/300], Step[1100/1602], Avg Loss: 3.9590, Avg Acc: 0.2176
+INFO:local_logger:Epoch[044/300], Step[1100/1602], Avg Loss: 3.9516, Avg Acc: 0.2165
+INFO:local_logger:Epoch[044/300], Step[1100/1602], Avg Loss: 3.9405, Avg Acc: 0.2207
+INFO:master_logger:Epoch[044/300], Step[1100/1602], Avg Loss: 3.9449, Avg Acc: 0.2165
+INFO:local_logger:Epoch[044/300], Step[1150/1602], Avg Loss: 3.9291, Avg Acc: 0.2108
+INFO:local_logger:Epoch[044/300], Step[1150/1602], Avg Loss: 3.9527, Avg Acc: 0.2170
+INFO:local_logger:Epoch[044/300], Step[1150/1602], Avg Loss: 3.9408, Avg Acc: 0.2203
+INFO:master_logger:Epoch[044/300], Step[1150/1602], Avg Loss: 3.9442, Avg Acc: 0.2168
+INFO:local_logger:Epoch[044/300], Step[1150/1602], Avg Loss: 3.9540, Avg Acc: 0.2190
+INFO:local_logger:Epoch[044/300], Step[1200/1602], Avg Loss: 3.9344, Avg Acc: 0.2100
+INFO:local_logger:Epoch[044/300], Step[1200/1602], Avg Loss: 3.9541, Avg Acc: 0.2156
+INFO:local_logger:Epoch[044/300], Step[1200/1602], Avg Loss: 3.9546, Avg Acc: 0.2192
+INFO:local_logger:Epoch[044/300], Step[1200/1602], Avg Loss: 3.9445, Avg Acc: 0.2211
+INFO:master_logger:Epoch[044/300], Step[1200/1602], Avg Loss: 3.9469, Avg Acc: 0.2165
+INFO:local_logger:Epoch[044/300], Step[1250/1602], Avg Loss: 3.9348, Avg Acc: 0.2103
+INFO:local_logger:Epoch[044/300], Step[1250/1602], Avg Loss: 3.9551, Avg Acc: 0.2179
+INFO:local_logger:Epoch[044/300], Step[1250/1602], Avg Loss: 3.9464, Avg Acc: 0.2225
+INFO:local_logger:Epoch[044/300], Step[1250/1602], Avg Loss: 3.9520, Avg Acc: 0.2173
+INFO:master_logger:Epoch[044/300], Step[1250/1602], Avg Loss: 3.9471, Avg Acc: 0.2170
+INFO:local_logger:Epoch[044/300], Step[1300/1602], Avg Loss: 3.9285, Avg Acc: 0.2108
+INFO:local_logger:Epoch[044/300], Step[1300/1602], Avg Loss: 3.9475, Avg Acc: 0.2220
+INFO:local_logger:Epoch[044/300], Step[1300/1602], Avg Loss: 3.9551, Avg Acc: 0.2171
+INFO:master_logger:Epoch[044/300], Step[1300/1602], Avg Loss: 3.9460, Avg Acc: 0.2167
+INFO:local_logger:Epoch[044/300], Step[1300/1602], Avg Loss: 3.9529, Avg Acc: 0.2170
+INFO:local_logger:Epoch[044/300], Step[1350/1602], Avg Loss: 3.9304, Avg Acc: 0.2110
+INFO:local_logger:Epoch[044/300], Step[1350/1602], Avg Loss: 3.9483, Avg Acc: 0.2220
+INFO:local_logger:Epoch[044/300], Step[1350/1602], Avg Loss: 3.9571, Avg Acc: 0.2162
+INFO:local_logger:Epoch[044/300], Step[1350/1602], Avg Loss: 3.9513, Avg Acc: 0.2178
+INFO:master_logger:Epoch[044/300], Step[1350/1602], Avg Loss: 3.9468, Avg Acc: 0.2167
+INFO:local_logger:Epoch[044/300], Step[1400/1602], Avg Loss: 3.9286, Avg Acc: 0.2124
+INFO:master_logger:Epoch[044/300], Step[1400/1602], Avg Loss: 3.9461, Avg Acc: 0.2171
+INFO:local_logger:Epoch[044/300], Step[1400/1602], Avg Loss: 3.9545, Avg Acc: 0.2177
+INFO:local_logger:Epoch[044/300], Step[1400/1602], Avg Loss: 3.9505, Avg Acc: 0.2207
+INFO:local_logger:Epoch[044/300], Step[1400/1602], Avg Loss: 3.9506, Avg Acc: 0.2175
+INFO:local_logger:Epoch[044/300], Step[1450/1602], Avg Loss: 3.9527, Avg Acc: 0.2174
+INFO:local_logger:Epoch[044/300], Step[1450/1602], Avg Loss: 3.9280, Avg Acc: 0.2130
+INFO:local_logger:Epoch[044/300], Step[1450/1602], Avg Loss: 3.9493, Avg Acc: 0.2211
+INFO:local_logger:Epoch[044/300], Step[1450/1602], Avg Loss: 3.9530, Avg Acc: 0.2185
+INFO:master_logger:Epoch[044/300], Step[1450/1602], Avg Loss: 3.9457, Avg Acc: 0.2175
+INFO:local_logger:Epoch[044/300], Step[1500/1602], Avg Loss: 3.9469, Avg Acc: 0.2207
+INFO:local_logger:Epoch[044/300], Step[1500/1602], Avg Loss: 3.9516, Avg Acc: 0.2177
+INFO:local_logger:Epoch[044/300], Step[1500/1602], Avg Loss: 3.9512, Avg Acc: 0.2187
+INFO:local_logger:Epoch[044/300], Step[1500/1602], Avg Loss: 3.9328, Avg Acc: 0.2115
+INFO:master_logger:Epoch[044/300], Step[1500/1602], Avg Loss: 3.9456, Avg Acc: 0.2171
+INFO:local_logger:Epoch[044/300], Step[1550/1602], Avg Loss: 3.9289, Avg Acc: 0.2112
+INFO:local_logger:Epoch[044/300], Step[1550/1602], Avg Loss: 3.9487, Avg Acc: 0.2195
+INFO:local_logger:Epoch[044/300], Step[1550/1602], Avg Loss: 3.9495, Avg Acc: 0.2196
+INFO:local_logger:Epoch[044/300], Step[1550/1602], Avg Loss: 3.9504, Avg Acc: 0.2187
+INFO:master_logger:Epoch[044/300], Step[1550/1602], Avg Loss: 3.9444, Avg Acc: 0.2172
+INFO:local_logger:Epoch[044/300], Step[1600/1602], Avg Loss: 3.9308, Avg Acc: 0.2103
+INFO:local_logger:Epoch[044/300], Step[1600/1602], Avg Loss: 3.9512, Avg Acc: 0.2194
+INFO:master_logger:Epoch[044/300], Step[1600/1602], Avg Loss: 3.9457, Avg Acc: 0.2168
+INFO:local_logger:Epoch[044/300], Step[1600/1602], Avg Loss: 3.9532, Avg Acc: 0.2185
+INFO:local_logger:Epoch[044/300], Step[1600/1602], Avg Loss: 3.9475, Avg Acc: 0.2189
+INFO:local_logger:----- Epoch[044/300], Train Loss: 3.9475, Train Acc: 0.2189, time: 3710.70
+INFO:local_logger:Now training epoch 45. LR=0.000374
+INFO:local_logger:----- Epoch[044/300], Train Loss: 3.9310, Train Acc: 0.2102, time: 3710.45
+INFO:local_logger:----- Epoch[044/300], Train Loss: 3.9510, Train Acc: 0.2195, time: 3710.69
+INFO:master_logger:----- Epoch[044/300], Train Loss: 3.9456, Train Acc: 0.2167, time: 3710.45
+INFO:local_logger:Now training epoch 45. LR=0.000374
+INFO:local_logger:----- Epoch[044/300], Train Loss: 3.9530, Train Acc: 0.2184, time: 3710.73
+INFO:local_logger:Now training epoch 45. LR=0.000374
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-44-Loss-3.9309606375021415.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-44-Loss-3.9309606375021415.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-44-Loss-3.9309606375021415-EMA.pdparams
+INFO:local_logger:Now training epoch 45. LR=0.000374
+INFO:master_logger:Now training epoch 45. LR=0.000374
+INFO:local_logger:Epoch[045/300], Step[0000/1602], Avg Loss: 3.9596, Avg Acc: 0.0550
+INFO:local_logger:Epoch[045/300], Step[0000/1602], Avg Loss: 3.4797, Avg Acc: 0.0100
+INFO:local_logger:Epoch[045/300], Step[0000/1602], Avg Loss: 4.6771, Avg Acc: 0.1850
+INFO:master_logger:Epoch[045/300], Step[0000/1602], Avg Loss: 4.1852, Avg Acc: 0.0825
+INFO:local_logger:Epoch[045/300], Step[0000/1602], Avg Loss: 4.6242, Avg Acc: 0.0800
+INFO:local_logger:Epoch[045/300], Step[0050/1602], Avg Loss: 3.9413, Avg Acc: 0.2069
+INFO:local_logger:Epoch[045/300], Step[0050/1602], Avg Loss: 3.8832, Avg Acc: 0.2360
+INFO:master_logger:Epoch[045/300], Step[0050/1602], Avg Loss: 3.8827, Avg Acc: 0.2214
+INFO:local_logger:Epoch[045/300], Step[0050/1602], Avg Loss: 3.8742, Avg Acc: 0.2330
+INFO:local_logger:Epoch[045/300], Step[0050/1602], Avg Loss: 3.8322, Avg Acc: 0.2096
+INFO:local_logger:Epoch[045/300], Step[0100/1602], Avg Loss: 3.9584, Avg Acc: 0.2178
+INFO:local_logger:Epoch[045/300], Step[0100/1602], Avg Loss: 3.8248, Avg Acc: 0.2233
+INFO:local_logger:Epoch[045/300], Step[0100/1602], Avg Loss: 3.8579, Avg Acc: 0.2417
+INFO:master_logger:Epoch[045/300], Step[0100/1602], Avg Loss: 3.8997, Avg Acc: 0.2244
+INFO:local_logger:Epoch[045/300], Step[0100/1602], Avg Loss: 3.9577, Avg Acc: 0.2149
+INFO:local_logger:Epoch[045/300], Step[0150/1602], Avg Loss: 4.0120, Avg Acc: 0.2067
+INFO:local_logger:Epoch[045/300], Step[0150/1602], Avg Loss: 3.8277, Avg Acc: 0.2317
+INFO:local_logger:Epoch[045/300], Step[0150/1602], Avg Loss: 3.8278, Avg Acc: 0.2225
+INFO:master_logger:Epoch[045/300], Step[0150/1602], Avg Loss: 3.9082, Avg Acc: 0.2175
+INFO:local_logger:Epoch[045/300], Step[0150/1602], Avg Loss: 3.9655, Avg Acc: 0.2091
+INFO:local_logger:Epoch[045/300], Step[0200/1602], Avg Loss: 3.9807, Avg Acc: 0.2186
+INFO:master_logger:Epoch[045/300], Step[0200/1602], Avg Loss: 3.9149, Avg Acc: 0.2176
+INFO:local_logger:Epoch[045/300], Step[0200/1602], Avg Loss: 3.9420, Avg Acc: 0.2091
+INFO:local_logger:Epoch[045/300], Step[0200/1602], Avg Loss: 3.8659, Avg Acc: 0.2222
+INFO:local_logger:Epoch[045/300], Step[0200/1602], Avg Loss: 3.8709, Avg Acc: 0.2203
+INFO:local_logger:Epoch[045/300], Step[0250/1602], Avg Loss: 3.9449, Avg Acc: 0.2196
+INFO:local_logger:Epoch[045/300], Step[0250/1602], Avg Loss: 3.9576, Avg Acc: 0.2022
+INFO:local_logger:Epoch[045/300], Step[0250/1602], Avg Loss: 3.8875, Avg Acc: 0.2249
+INFO:local_logger:Epoch[045/300], Step[0250/1602], Avg Loss: 3.8738, Avg Acc: 0.2250
+INFO:master_logger:Epoch[045/300], Step[0250/1602], Avg Loss: 3.9160, Avg Acc: 0.2179
+INFO:local_logger:Epoch[045/300], Step[0300/1602], Avg Loss: 3.9323, Avg Acc: 0.2208
+INFO:local_logger:Epoch[045/300], Step[0300/1602], Avg Loss: 3.9300, Avg Acc: 0.2160
+INFO:local_logger:Epoch[045/300], Step[0300/1602], Avg Loss: 3.8782, Avg Acc: 0.2289
+INFO:local_logger:Epoch[045/300], Step[0300/1602], Avg Loss: 3.9707, Avg Acc: 0.2037
+INFO:master_logger:Epoch[045/300], Step[0300/1602], Avg Loss: 3.9278, Avg Acc: 0.2174
+INFO:local_logger:Epoch[045/300], Step[0350/1602], Avg Loss: 3.9173, Avg Acc: 0.2226
+INFO:local_logger:Epoch[045/300], Step[0350/1602], Avg Loss: 3.9537, Avg Acc: 0.2061
+INFO:local_logger:Epoch[045/300], Step[0350/1602], Avg Loss: 3.9263, Avg Acc: 0.2254
+INFO:local_logger:Epoch[045/300], Step[0350/1602], Avg Loss: 3.8832, Avg Acc: 0.2297
+INFO:master_logger:Epoch[045/300], Step[0350/1602], Avg Loss: 3.9201, Avg Acc: 0.2210
+INFO:local_logger:Epoch[045/300], Step[0400/1602], Avg Loss: 3.9261, Avg Acc: 0.2190
+INFO:local_logger:Epoch[045/300], Step[0400/1602], Avg Loss: 3.9577, Avg Acc: 0.2104
+INFO:local_logger:Epoch[045/300], Step[0400/1602], Avg Loss: 3.8685, Avg Acc: 0.2278
+INFO:local_logger:Epoch[045/300], Step[0400/1602], Avg Loss: 3.9233, Avg Acc: 0.2260
+INFO:master_logger:Epoch[045/300], Step[0400/1602], Avg Loss: 3.9189, Avg Acc: 0.2208
+INFO:local_logger:Epoch[045/300], Step[0450/1602], Avg Loss: 3.9227, Avg Acc: 0.2187
+INFO:local_logger:Epoch[045/300], Step[0450/1602], Avg Loss: 3.8953, Avg Acc: 0.2252
+INFO:local_logger:Epoch[045/300], Step[0450/1602], Avg Loss: 3.9250, Avg Acc: 0.2231
+INFO:local_logger:Epoch[045/300], Step[0450/1602], Avg Loss: 3.9538, Avg Acc: 0.2123
+INFO:master_logger:Epoch[045/300], Step[0450/1602], Avg Loss: 3.9242, Avg Acc: 0.2198
+INFO:local_logger:Epoch[045/300], Step[0500/1602], Avg Loss: 3.9181, Avg Acc: 0.2183
+INFO:local_logger:Epoch[045/300], Step[0500/1602], Avg Loss: 3.9070, Avg Acc: 0.2232
+INFO:local_logger:Epoch[045/300], Step[0500/1602], Avg Loss: 3.9419, Avg Acc: 0.2165
+INFO:master_logger:Epoch[045/300], Step[0500/1602], Avg Loss: 3.9208, Avg Acc: 0.2210
+INFO:local_logger:Epoch[045/300], Step[0500/1602], Avg Loss: 3.9161, Avg Acc: 0.2257
+INFO:local_logger:Epoch[045/300], Step[0550/1602], Avg Loss: 3.9139, Avg Acc: 0.2176
+INFO:local_logger:Epoch[045/300], Step[0550/1602], Avg Loss: 3.9000, Avg Acc: 0.2248
+INFO:local_logger:Epoch[045/300], Step[0550/1602], Avg Loss: 3.9504, Avg Acc: 0.2157
+INFO:local_logger:Epoch[045/300], Step[0550/1602], Avg Loss: 3.9281, Avg Acc: 0.2229
+INFO:master_logger:Epoch[045/300], Step[0550/1602], Avg Loss: 3.9231, Avg Acc: 0.2203
+INFO:local_logger:Epoch[045/300], Step[0600/1602], Avg Loss: 3.9114, Avg Acc: 0.2200
+INFO:local_logger:Epoch[045/300], Step[0600/1602], Avg Loss: 3.9356, Avg Acc: 0.2187
+INFO:local_logger:Epoch[045/300], Step[0600/1602], Avg Loss: 3.9130, Avg Acc: 0.2250
+INFO:local_logger:Epoch[045/300], Step[0600/1602], Avg Loss: 3.9270, Avg Acc: 0.2231
+INFO:master_logger:Epoch[045/300], Step[0600/1602], Avg Loss: 3.9217, Avg Acc: 0.2217
+INFO:local_logger:Epoch[045/300], Step[0650/1602], Avg Loss: 3.9190, Avg Acc: 0.2245
+INFO:local_logger:Epoch[045/300], Step[0650/1602], Avg Loss: 3.9193, Avg Acc: 0.2210
+INFO:local_logger:Epoch[045/300], Step[0650/1602], Avg Loss: 3.9297, Avg Acc: 0.2197
+INFO:master_logger:Epoch[045/300], Step[0650/1602], Avg Loss: 3.9210, Avg Acc: 0.2219
+INFO:local_logger:Epoch[045/300], Step[0650/1602], Avg Loss: 3.9160, Avg Acc: 0.2223
+INFO:local_logger:Epoch[045/300], Step[0700/1602], Avg Loss: 3.9312, Avg Acc: 0.2182
+INFO:local_logger:Epoch[045/300], Step[0700/1602], Avg Loss: 3.9148, Avg Acc: 0.2240
+INFO:local_logger:Epoch[045/300], Step[0700/1602], Avg Loss: 3.9236, Avg Acc: 0.2198
+INFO:local_logger:Epoch[045/300], Step[0700/1602], Avg Loss: 3.9220, Avg Acc: 0.2261
+INFO:master_logger:Epoch[045/300], Step[0700/1602], Avg Loss: 3.9229, Avg Acc: 0.2220
+INFO:local_logger:Epoch[045/300], Step[0750/1602], Avg Loss: 3.9268, Avg Acc: 0.2207
+INFO:local_logger:Epoch[045/300], Step[0750/1602], Avg Loss: 3.9410, Avg Acc: 0.2147
+INFO:local_logger:Epoch[045/300], Step[0750/1602], Avg Loss: 3.9134, Avg Acc: 0.2232
+INFO:local_logger:Epoch[045/300], Step[0750/1602], Avg Loss: 3.9226, Avg Acc: 0.2242
+INFO:master_logger:Epoch[045/300], Step[0750/1602], Avg Loss: 3.9259, Avg Acc: 0.2207
+INFO:local_logger:Epoch[045/300], Step[0800/1602], Avg Loss: 3.9334, Avg Acc: 0.2215
+INFO:local_logger:Epoch[045/300], Step[0800/1602], Avg Loss: 3.9372, Avg Acc: 0.2138
+INFO:local_logger:Epoch[045/300], Step[0800/1602], Avg Loss: 3.9151, Avg Acc: 0.2225
+INFO:master_logger:Epoch[045/300], Step[0800/1602], Avg Loss: 3.9264, Avg Acc: 0.2208
+INFO:local_logger:Epoch[045/300], Step[0800/1602], Avg Loss: 3.9198, Avg Acc: 0.2253
+INFO:local_logger:Epoch[045/300], Step[0850/1602], Avg Loss: 3.9333, Avg Acc: 0.2191
+INFO:local_logger:Epoch[045/300], Step[0850/1602], Avg Loss: 3.9185, Avg Acc: 0.2220
+INFO:local_logger:Epoch[045/300], Step[0850/1602], Avg Loss: 3.9226, Avg Acc: 0.2231
+INFO:local_logger:Epoch[045/300], Step[0850/1602], Avg Loss: 3.9469, Avg Acc: 0.2128
+INFO:master_logger:Epoch[045/300], Step[0850/1602], Avg Loss: 3.9303, Avg Acc: 0.2193
+INFO:local_logger:Epoch[045/300], Step[0900/1602], Avg Loss: 3.9467, Avg Acc: 0.2120
+INFO:local_logger:Epoch[045/300], Step[0900/1602], Avg Loss: 3.9313, Avg Acc: 0.2186
+INFO:local_logger:Epoch[045/300], Step[0900/1602], Avg Loss: 3.9287, Avg Acc: 0.2214
+INFO:local_logger:Epoch[045/300], Step[0900/1602], Avg Loss: 3.9141, Avg Acc: 0.2213
+INFO:master_logger:Epoch[045/300], Step[0900/1602], Avg Loss: 3.9302, Avg Acc: 0.2183
+INFO:local_logger:Epoch[045/300], Step[0950/1602], Avg Loss: 3.9320, Avg Acc: 0.2197
+INFO:local_logger:Epoch[045/300], Step[0950/1602], Avg Loss: 3.9294, Avg Acc: 0.2203
+INFO:local_logger:Epoch[045/300], Step[0950/1602], Avg Loss: 3.9499, Avg Acc: 0.2105
+INFO:master_logger:Epoch[045/300], Step[0950/1602], Avg Loss: 3.9330, Avg Acc: 0.2181
+INFO:local_logger:Epoch[045/300], Step[0950/1602], Avg Loss: 3.9205, Avg Acc: 0.2218
+INFO:local_logger:Epoch[045/300], Step[1000/1602], Avg Loss: 3.9317, Avg Acc: 0.2213
+INFO:local_logger:Epoch[045/300], Step[1000/1602], Avg Loss: 3.9213, Avg Acc: 0.2227
+INFO:local_logger:Epoch[045/300], Step[1000/1602], Avg Loss: 3.9216, Avg Acc: 0.2203
+INFO:master_logger:Epoch[045/300], Step[1000/1602], Avg Loss: 3.9322, Avg Acc: 0.2186
+INFO:local_logger:Epoch[045/300], Step[1000/1602], Avg Loss: 3.9541, Avg Acc: 0.2100
+INFO:local_logger:Epoch[045/300], Step[1050/1602], Avg Loss: 3.9343, Avg Acc: 0.2225
+INFO:local_logger:Epoch[045/300], Step[1050/1602], Avg Loss: 3.9230, Avg Acc: 0.2186
+INFO:local_logger:Epoch[045/300], Step[1050/1602], Avg Loss: 3.9575, Avg Acc: 0.2099
+INFO:local_logger:Epoch[045/300], Step[1050/1602], Avg Loss: 3.9195, Avg Acc: 0.2232
+INFO:master_logger:Epoch[045/300], Step[1050/1602], Avg Loss: 3.9336, Avg Acc: 0.2185
+INFO:local_logger:Epoch[045/300], Step[1100/1602], Avg Loss: 3.9284, Avg Acc: 0.2220
+INFO:local_logger:Epoch[045/300], Step[1100/1602], Avg Loss: 3.9506, Avg Acc: 0.2107
+INFO:local_logger:Epoch[045/300], Step[1100/1602], Avg Loss: 3.9243, Avg Acc: 0.2188
+INFO:master_logger:Epoch[045/300], Step[1100/1602], Avg Loss: 3.9310, Avg Acc: 0.2187
+INFO:local_logger:Epoch[045/300], Step[1100/1602], Avg Loss: 3.9207, Avg Acc: 0.2234
+INFO:local_logger:Epoch[045/300], Step[1150/1602], Avg Loss: 3.9174, Avg Acc: 0.2240
+INFO:local_logger:Epoch[045/300], Step[1150/1602], Avg Loss: 3.9249, Avg Acc: 0.2224
+INFO:local_logger:Epoch[045/300], Step[1150/1602], Avg Loss: 3.9536, Avg Acc: 0.2121
+INFO:local_logger:Epoch[045/300], Step[1150/1602], Avg Loss: 3.9237, Avg Acc: 0.2188
+INFO:master_logger:Epoch[045/300], Step[1150/1602], Avg Loss: 3.9299, Avg Acc: 0.2193
+INFO:local_logger:Epoch[045/300], Step[1200/1602], Avg Loss: 3.9274, Avg Acc: 0.2220
+INFO:local_logger:Epoch[045/300], Step[1200/1602], Avg Loss: 3.9565, Avg Acc: 0.2121
+INFO:local_logger:Epoch[045/300], Step[1200/1602], Avg Loss: 3.9242, Avg Acc: 0.2179
+INFO:local_logger:Epoch[045/300], Step[1200/1602], Avg Loss: 3.9183, Avg Acc: 0.2245
+INFO:master_logger:Epoch[045/300], Step[1200/1602], Avg Loss: 3.9316, Avg Acc: 0.2192
+INFO:local_logger:Epoch[045/300], Step[1250/1602], Avg Loss: 3.9254, Avg Acc: 0.2210
+INFO:local_logger:Epoch[045/300], Step[1250/1602], Avg Loss: 3.9232, Avg Acc: 0.2189
+INFO:local_logger:Epoch[045/300], Step[1250/1602], Avg Loss: 3.9240, Avg Acc: 0.2254
+INFO:local_logger:Epoch[045/300], Step[1250/1602], Avg Loss: 3.9570, Avg Acc: 0.2125
+INFO:master_logger:Epoch[045/300], Step[1250/1602], Avg Loss: 3.9324, Avg Acc: 0.2194
+INFO:local_logger:Epoch[045/300], Step[1300/1602], Avg Loss: 3.9263, Avg Acc: 0.2246
+INFO:local_logger:Epoch[045/300], Step[1300/1602], Avg Loss: 3.9282, Avg Acc: 0.2210
+INFO:local_logger:Epoch[045/300], Step[1300/1602], Avg Loss: 3.9255, Avg Acc: 0.2200
+INFO:local_logger:Epoch[045/300], Step[1300/1602], Avg Loss: 3.9514, Avg Acc: 0.2128
+INFO:master_logger:Epoch[045/300], Step[1300/1602], Avg Loss: 3.9329, Avg Acc: 0.2196
+INFO:local_logger:Epoch[045/300], Step[1350/1602], Avg Loss: 3.9348, Avg Acc: 0.2199
+INFO:local_logger:Epoch[045/300], Step[1350/1602], Avg Loss: 3.9244, Avg Acc: 0.2249
+INFO:local_logger:Epoch[045/300], Step[1350/1602], Avg Loss: 3.9232, Avg Acc: 0.2182
+INFO:local_logger:Epoch[045/300], Step[1350/1602], Avg Loss: 3.9513, Avg Acc: 0.2121
+INFO:master_logger:Epoch[045/300], Step[1350/1602], Avg Loss: 3.9334, Avg Acc: 0.2188
+INFO:local_logger:Epoch[045/300], Step[1400/1602], Avg Loss: 3.9384, Avg Acc: 0.2194
+INFO:master_logger:Epoch[045/300], Step[1400/1602], Avg Loss: 3.9347, Avg Acc: 0.2187
+INFO:local_logger:Epoch[045/300], Step[1400/1602], Avg Loss: 3.9484, Avg Acc: 0.2129
+INFO:local_logger:Epoch[045/300], Step[1400/1602], Avg Loss: 3.9264, Avg Acc: 0.2184
+INFO:local_logger:Epoch[045/300], Step[1400/1602], Avg Loss: 3.9254, Avg Acc: 0.2242
+INFO:local_logger:Epoch[045/300], Step[1450/1602], Avg Loss: 3.9390, Avg Acc: 0.2194
+INFO:local_logger:Epoch[045/300], Step[1450/1602], Avg Loss: 3.9238, Avg Acc: 0.2243
+INFO:local_logger:Epoch[045/300], Step[1450/1602], Avg Loss: 3.9466, Avg Acc: 0.2135
+INFO:master_logger:Epoch[045/300], Step[1450/1602], Avg Loss: 3.9341, Avg Acc: 0.2188
+INFO:local_logger:Epoch[045/300], Step[1450/1602], Avg Loss: 3.9270, Avg Acc: 0.2179
+INFO:local_logger:Epoch[045/300], Step[1500/1602], Avg Loss: 3.9282, Avg Acc: 0.2239
+INFO:local_logger:Epoch[045/300], Step[1500/1602], Avg Loss: 3.9339, Avg Acc: 0.2188
+INFO:local_logger:Epoch[045/300], Step[1500/1602], Avg Loss: 3.9476, Avg Acc: 0.2131
+INFO:local_logger:Epoch[045/300], Step[1500/1602], Avg Loss: 3.9272, Avg Acc: 0.2173
+INFO:master_logger:Epoch[045/300], Step[1500/1602], Avg Loss: 3.9342, Avg Acc: 0.2183
+INFO:local_logger:Epoch[045/300], Step[1550/1602], Avg Loss: 3.9315, Avg Acc: 0.2228
+INFO:local_logger:Epoch[045/300], Step[1550/1602], Avg Loss: 3.9338, Avg Acc: 0.2188
+INFO:local_logger:Epoch[045/300], Step[1550/1602], Avg Loss: 3.9210, Avg Acc: 0.2187
+INFO:local_logger:Epoch[045/300], Step[1550/1602], Avg Loss: 3.9474, Avg Acc: 0.2139
+INFO:master_logger:Epoch[045/300], Step[1550/1602], Avg Loss: 3.9334, Avg Acc: 0.2185
+INFO:local_logger:Epoch[045/300], Step[1600/1602], Avg Loss: 3.9493, Avg Acc: 0.2134
+INFO:local_logger:Epoch[045/300], Step[1600/1602], Avg Loss: 3.9249, Avg Acc: 0.2180
+INFO:local_logger:Epoch[045/300], Step[1600/1602], Avg Loss: 3.9354, Avg Acc: 0.2183
+INFO:local_logger:Epoch[045/300], Step[1600/1602], Avg Loss: 3.9325, Avg Acc: 0.2225
+INFO:master_logger:Epoch[045/300], Step[1600/1602], Avg Loss: 3.9355, Avg Acc: 0.2180
+INFO:local_logger:----- Epoch[045/300], Train Loss: 3.9325, Train Acc: 0.2225, time: 3733.41
+INFO:local_logger:Now training epoch 46. LR=0.000373
+INFO:local_logger:----- Epoch[045/300], Train Loss: 3.9493, Train Acc: 0.2134, time: 3733.42
+INFO:local_logger:Now training epoch 46. LR=0.000373
+INFO:local_logger:----- Epoch[045/300], Train Loss: 3.9357, Train Acc: 0.2182, time: 3733.52
+INFO:local_logger:----- Epoch[045/300], Train Loss: 3.9249, Train Acc: 0.2179, time: 3733.76
+INFO:master_logger:----- Epoch[045/300], Train Loss: 3.9356, Train Acc: 0.2180, time: 3733.52
+INFO:local_logger:Now training epoch 46. LR=0.000373
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-45-Loss-3.9356944448782087.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-45-Loss-3.9356944448782087.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-45-Loss-3.9356944448782087-EMA.pdparams
+INFO:local_logger:Now training epoch 46. LR=0.000373
+INFO:master_logger:Now training epoch 46. LR=0.000373
+INFO:local_logger:Epoch[046/300], Step[0000/1602], Avg Loss: 3.3028, Avg Acc: 0.3300
+INFO:local_logger:Epoch[046/300], Step[0000/1602], Avg Loss: 4.2058, Avg Acc: 0.2700
+INFO:local_logger:Epoch[046/300], Step[0000/1602], Avg Loss: 3.2653, Avg Acc: 0.4250
+INFO:local_logger:Epoch[046/300], Step[0000/1602], Avg Loss: 4.3832, Avg Acc: 0.0550
+INFO:master_logger:Epoch[046/300], Step[0000/1602], Avg Loss: 3.7893, Avg Acc: 0.2700
+INFO:local_logger:Epoch[046/300], Step[0050/1602], Avg Loss: 3.8984, Avg Acc: 0.2287
+INFO:local_logger:Epoch[046/300], Step[0050/1602], Avg Loss: 3.8793, Avg Acc: 0.2144
+INFO:local_logger:Epoch[046/300], Step[0050/1602], Avg Loss: 3.9000, Avg Acc: 0.2333
+INFO:local_logger:Epoch[046/300], Step[0050/1602], Avg Loss: 3.9479, Avg Acc: 0.2322
+INFO:master_logger:Epoch[046/300], Step[0050/1602], Avg Loss: 3.9064, Avg Acc: 0.2272
+INFO:local_logger:Epoch[046/300], Step[0100/1602], Avg Loss: 3.9244, Avg Acc: 0.2262
+INFO:local_logger:Epoch[046/300], Step[0100/1602], Avg Loss: 3.8932, Avg Acc: 0.2141
+INFO:local_logger:Epoch[046/300], Step[0100/1602], Avg Loss: 3.8933, Avg Acc: 0.2456
+INFO:master_logger:Epoch[046/300], Step[0100/1602], Avg Loss: 3.8956, Avg Acc: 0.2293
+INFO:local_logger:Epoch[046/300], Step[0100/1602], Avg Loss: 3.8714, Avg Acc: 0.2314
+INFO:local_logger:Epoch[046/300], Step[0150/1602], Avg Loss: 3.9206, Avg Acc: 0.2204
+INFO:local_logger:Epoch[046/300], Step[0150/1602], Avg Loss: 3.9045, Avg Acc: 0.2241
+INFO:local_logger:Epoch[046/300], Step[0150/1602], Avg Loss: 3.8727, Avg Acc: 0.2299
+INFO:local_logger:Epoch[046/300], Step[0150/1602], Avg Loss: 3.8519, Avg Acc: 0.2279
+INFO:master_logger:Epoch[046/300], Step[0150/1602], Avg Loss: 3.8874, Avg Acc: 0.2256
+INFO:local_logger:Epoch[046/300], Step[0200/1602], Avg Loss: 3.9464, Avg Acc: 0.2200
+INFO:local_logger:Epoch[046/300], Step[0200/1602], Avg Loss: 3.9406, Avg Acc: 0.2246
+INFO:local_logger:Epoch[046/300], Step[0200/1602], Avg Loss: 3.8782, Avg Acc: 0.2254
+INFO:local_logger:Epoch[046/300], Step[0200/1602], Avg Loss: 3.8749, Avg Acc: 0.2281
+INFO:master_logger:Epoch[046/300], Step[0200/1602], Avg Loss: 3.9100, Avg Acc: 0.2245
+INFO:local_logger:Epoch[046/300], Step[0250/1602], Avg Loss: 3.9215, Avg Acc: 0.2257
+INFO:local_logger:Epoch[046/300], Step[0250/1602], Avg Loss: 3.8756, Avg Acc: 0.2255
+INFO:local_logger:Epoch[046/300], Step[0250/1602], Avg Loss: 3.9233, Avg Acc: 0.2250
+INFO:local_logger:Epoch[046/300], Step[0250/1602], Avg Loss: 3.8889, Avg Acc: 0.2271
+INFO:master_logger:Epoch[046/300], Step[0250/1602], Avg Loss: 3.9023, Avg Acc: 0.2258
+INFO:local_logger:Epoch[046/300], Step[0300/1602], Avg Loss: 3.9214, Avg Acc: 0.2232
+INFO:local_logger:Epoch[046/300], Step[0300/1602], Avg Loss: 3.9225, Avg Acc: 0.2226
+INFO:local_logger:Epoch[046/300], Step[0300/1602], Avg Loss: 3.8552, Avg Acc: 0.2293
+INFO:local_logger:Epoch[046/300], Step[0300/1602], Avg Loss: 3.8893, Avg Acc: 0.2340
+INFO:master_logger:Epoch[046/300], Step[0300/1602], Avg Loss: 3.8971, Avg Acc: 0.2273
+INFO:local_logger:Epoch[046/300], Step[0350/1602], Avg Loss: 3.9098, Avg Acc: 0.2155
+INFO:master_logger:Epoch[046/300], Step[0350/1602], Avg Loss: 3.8927, Avg Acc: 0.2259
+INFO:local_logger:Epoch[046/300], Step[0350/1602], Avg Loss: 3.9304, Avg Acc: 0.2249
+INFO:local_logger:Epoch[046/300], Step[0350/1602], Avg Loss: 3.8872, Avg Acc: 0.2329
+INFO:local_logger:Epoch[046/300], Step[0350/1602], Avg Loss: 3.8433, Avg Acc: 0.2302
+INFO:local_logger:Epoch[046/300], Step[0400/1602], Avg Loss: 3.8557, Avg Acc: 0.2304
+INFO:local_logger:Epoch[046/300], Step[0400/1602], Avg Loss: 3.9424, Avg Acc: 0.2220
+INFO:local_logger:Epoch[046/300], Step[0400/1602], Avg Loss: 3.9055, Avg Acc: 0.2309
+INFO:local_logger:Epoch[046/300], Step[0400/1602], Avg Loss: 3.9143, Avg Acc: 0.2164
+INFO:master_logger:Epoch[046/300], Step[0400/1602], Avg Loss: 3.9045, Avg Acc: 0.2249
+INFO:local_logger:Epoch[046/300], Step[0450/1602], Avg Loss: 3.9127, Avg Acc: 0.2161
+INFO:master_logger:Epoch[046/300], Step[0450/1602], Avg Loss: 3.9077, Avg Acc: 0.2241
+INFO:local_logger:Epoch[046/300], Step[0450/1602], Avg Loss: 3.8992, Avg Acc: 0.2306
+INFO:local_logger:Epoch[046/300], Step[0450/1602], Avg Loss: 3.9477, Avg Acc: 0.2208
+INFO:local_logger:Epoch[046/300], Step[0450/1602], Avg Loss: 3.8712, Avg Acc: 0.2290
+INFO:local_logger:Epoch[046/300], Step[0500/1602], Avg Loss: 3.9033, Avg Acc: 0.2296
+INFO:local_logger:Epoch[046/300], Step[0500/1602], Avg Loss: 3.9222, Avg Acc: 0.2139
+INFO:local_logger:Epoch[046/300], Step[0500/1602], Avg Loss: 3.9357, Avg Acc: 0.2223
+INFO:local_logger:Epoch[046/300], Step[0500/1602], Avg Loss: 3.8836, Avg Acc: 0.2286
+INFO:master_logger:Epoch[046/300], Step[0500/1602], Avg Loss: 3.9112, Avg Acc: 0.2236
+INFO:local_logger:Epoch[046/300], Step[0550/1602], Avg Loss: 3.9234, Avg Acc: 0.2137
+INFO:local_logger:Epoch[046/300], Step[0550/1602], Avg Loss: 3.9383, Avg Acc: 0.2183
+INFO:local_logger:Epoch[046/300], Step[0550/1602], Avg Loss: 3.9164, Avg Acc: 0.2317
+INFO:local_logger:Epoch[046/300], Step[0550/1602], Avg Loss: 3.9030, Avg Acc: 0.2278
+INFO:master_logger:Epoch[046/300], Step[0550/1602], Avg Loss: 3.9203, Avg Acc: 0.2229
+INFO:local_logger:Epoch[046/300], Step[0600/1602], Avg Loss: 3.9268, Avg Acc: 0.2147
+INFO:local_logger:Epoch[046/300], Step[0600/1602], Avg Loss: 3.9118, Avg Acc: 0.2265
+INFO:local_logger:Epoch[046/300], Step[0600/1602], Avg Loss: 3.9228, Avg Acc: 0.2316
+INFO:local_logger:Epoch[046/300], Step[0600/1602], Avg Loss: 3.9247, Avg Acc: 0.2190
+INFO:master_logger:Epoch[046/300], Step[0600/1602], Avg Loss: 3.9215, Avg Acc: 0.2229
+INFO:local_logger:Epoch[046/300], Step[0650/1602], Avg Loss: 3.9033, Avg Acc: 0.2264
+INFO:local_logger:Epoch[046/300], Step[0650/1602], Avg Loss: 3.9251, Avg Acc: 0.2172
+INFO:local_logger:Epoch[046/300], Step[0650/1602], Avg Loss: 3.9268, Avg Acc: 0.2192
+INFO:local_logger:Epoch[046/300], Step[0650/1602], Avg Loss: 3.9277, Avg Acc: 0.2324
+INFO:master_logger:Epoch[046/300], Step[0650/1602], Avg Loss: 3.9207, Avg Acc: 0.2238
+INFO:local_logger:Epoch[046/300], Step[0700/1602], Avg Loss: 3.9022, Avg Acc: 0.2228
+INFO:local_logger:Epoch[046/300], Step[0700/1602], Avg Loss: 3.9277, Avg Acc: 0.2321
+INFO:local_logger:Epoch[046/300], Step[0700/1602], Avg Loss: 3.9261, Avg Acc: 0.2190
+INFO:local_logger:Epoch[046/300], Step[0700/1602], Avg Loss: 3.9269, Avg Acc: 0.2197
+INFO:master_logger:Epoch[046/300], Step[0700/1602], Avg Loss: 3.9207, Avg Acc: 0.2234
+INFO:local_logger:Epoch[046/300], Step[0750/1602], Avg Loss: 3.9271, Avg Acc: 0.2179
+INFO:local_logger:Epoch[046/300], Step[0750/1602], Avg Loss: 3.9308, Avg Acc: 0.2185
+INFO:local_logger:Epoch[046/300], Step[0750/1602], Avg Loss: 3.9310, Avg Acc: 0.2298
+INFO:local_logger:Epoch[046/300], Step[0750/1602], Avg Loss: 3.9052, Avg Acc: 0.2225
+INFO:master_logger:Epoch[046/300], Step[0750/1602], Avg Loss: 3.9235, Avg Acc: 0.2222
+INFO:local_logger:Epoch[046/300], Step[0800/1602], Avg Loss: 3.9210, Avg Acc: 0.2171
+INFO:local_logger:Epoch[046/300], Step[0800/1602], Avg Loss: 3.9078, Avg Acc: 0.2221
+INFO:master_logger:Epoch[046/300], Step[0800/1602], Avg Loss: 3.9223, Avg Acc: 0.2220
+INFO:local_logger:Epoch[046/300], Step[0800/1602], Avg Loss: 3.9340, Avg Acc: 0.2297
+INFO:local_logger:Epoch[046/300], Step[0800/1602], Avg Loss: 3.9266, Avg Acc: 0.2190
+INFO:local_logger:Epoch[046/300], Step[0850/1602], Avg Loss: 3.9235, Avg Acc: 0.2152
+INFO:local_logger:Epoch[046/300], Step[0850/1602], Avg Loss: 3.9242, Avg Acc: 0.2186
+INFO:local_logger:Epoch[046/300], Step[0850/1602], Avg Loss: 3.9384, Avg Acc: 0.2288
+INFO:master_logger:Epoch[046/300], Step[0850/1602], Avg Loss: 3.9248, Avg Acc: 0.2209
+INFO:local_logger:Epoch[046/300], Step[0850/1602], Avg Loss: 3.9131, Avg Acc: 0.2210
+INFO:local_logger:Epoch[046/300], Step[0900/1602], Avg Loss: 3.9208, Avg Acc: 0.2166
+INFO:master_logger:Epoch[046/300], Step[0900/1602], Avg Loss: 3.9219, Avg Acc: 0.2209
+INFO:local_logger:Epoch[046/300], Step[0900/1602], Avg Loss: 3.9211, Avg Acc: 0.2184
+INFO:local_logger:Epoch[046/300], Step[0900/1602], Avg Loss: 3.9101, Avg Acc: 0.2215
+INFO:local_logger:Epoch[046/300], Step[0900/1602], Avg Loss: 3.9358, Avg Acc: 0.2271
+INFO:local_logger:Epoch[046/300], Step[0950/1602], Avg Loss: 3.9153, Avg Acc: 0.2161
+INFO:local_logger:Epoch[046/300], Step[0950/1602], Avg Loss: 3.9384, Avg Acc: 0.2258
+INFO:local_logger:Epoch[046/300], Step[0950/1602], Avg Loss: 3.9243, Avg Acc: 0.2188
+INFO:local_logger:Epoch[046/300], Step[0950/1602], Avg Loss: 3.9083, Avg Acc: 0.2217
+INFO:master_logger:Epoch[046/300], Step[0950/1602], Avg Loss: 3.9216, Avg Acc: 0.2206
+INFO:local_logger:Epoch[046/300], Step[1000/1602], Avg Loss: 3.9145, Avg Acc: 0.2177
+INFO:local_logger:Epoch[046/300], Step[1000/1602], Avg Loss: 3.9058, Avg Acc: 0.2228
+INFO:local_logger:Epoch[046/300], Step[1000/1602], Avg Loss: 3.9384, Avg Acc: 0.2248
+INFO:local_logger:Epoch[046/300], Step[1000/1602], Avg Loss: 3.9215, Avg Acc: 0.2177
+INFO:master_logger:Epoch[046/300], Step[1000/1602], Avg Loss: 3.9200, Avg Acc: 0.2207
+INFO:local_logger:Epoch[046/300], Step[1050/1602], Avg Loss: 3.9208, Avg Acc: 0.2155
+INFO:local_logger:Epoch[046/300], Step[1050/1602], Avg Loss: 3.9199, Avg Acc: 0.2175
+INFO:local_logger:Epoch[046/300], Step[1050/1602], Avg Loss: 3.9069, Avg Acc: 0.2228
+INFO:local_logger:Epoch[046/300], Step[1050/1602], Avg Loss: 3.9412, Avg Acc: 0.2240
+INFO:master_logger:Epoch[046/300], Step[1050/1602], Avg Loss: 3.9222, Avg Acc: 0.2199
+INFO:local_logger:Epoch[046/300], Step[1100/1602], Avg Loss: 3.9231, Avg Acc: 0.2178
+INFO:local_logger:Epoch[046/300], Step[1100/1602], Avg Loss: 3.9384, Avg Acc: 0.2240
+INFO:local_logger:Epoch[046/300], Step[1100/1602], Avg Loss: 3.9207, Avg Acc: 0.2185
+INFO:local_logger:Epoch[046/300], Step[1100/1602], Avg Loss: 3.9079, Avg Acc: 0.2212
+INFO:master_logger:Epoch[046/300], Step[1100/1602], Avg Loss: 3.9225, Avg Acc: 0.2204
+INFO:local_logger:Epoch[046/300], Step[1150/1602], Avg Loss: 3.9347, Avg Acc: 0.2249
+INFO:local_logger:Epoch[046/300], Step[1150/1602], Avg Loss: 3.9228, Avg Acc: 0.2187
+INFO:local_logger:Epoch[046/300], Step[1150/1602], Avg Loss: 3.9156, Avg Acc: 0.2199
+INFO:master_logger:Epoch[046/300], Step[1150/1602], Avg Loss: 3.9196, Avg Acc: 0.2212
+INFO:local_logger:Epoch[046/300], Step[1150/1602], Avg Loss: 3.9054, Avg Acc: 0.2214
+INFO:local_logger:Epoch[046/300], Step[1200/1602], Avg Loss: 3.9239, Avg Acc: 0.2196
+INFO:local_logger:Epoch[046/300], Step[1200/1602], Avg Loss: 3.9085, Avg Acc: 0.2223
+INFO:local_logger:Epoch[046/300], Step[1200/1602], Avg Loss: 3.9183, Avg Acc: 0.2216
+INFO:local_logger:Epoch[046/300], Step[1200/1602], Avg Loss: 3.9336, Avg Acc: 0.2254
+INFO:master_logger:Epoch[046/300], Step[1200/1602], Avg Loss: 3.9211, Avg Acc: 0.2222
+INFO:local_logger:Epoch[046/300], Step[1250/1602], Avg Loss: 3.9237, Avg Acc: 0.2186
+INFO:local_logger:Epoch[046/300], Step[1250/1602], Avg Loss: 3.9193, Avg Acc: 0.2216
+INFO:local_logger:Epoch[046/300], Step[1250/1602], Avg Loss: 3.9065, Avg Acc: 0.2219
+INFO:master_logger:Epoch[046/300], Step[1250/1602], Avg Loss: 3.9199, Avg Acc: 0.2218
+INFO:local_logger:Epoch[046/300], Step[1250/1602], Avg Loss: 3.9299, Avg Acc: 0.2250
+INFO:local_logger:Epoch[046/300], Step[1300/1602], Avg Loss: 3.9269, Avg Acc: 0.2181
+INFO:local_logger:Epoch[046/300], Step[1300/1602], Avg Loss: 3.9322, Avg Acc: 0.2240
+INFO:local_logger:Epoch[046/300], Step[1300/1602], Avg Loss: 3.9206, Avg Acc: 0.2207
+INFO:local_logger:Epoch[046/300], Step[1300/1602], Avg Loss: 3.9016, Avg Acc: 0.2226
+INFO:master_logger:Epoch[046/300], Step[1300/1602], Avg Loss: 3.9203, Avg Acc: 0.2213
+INFO:local_logger:Epoch[046/300], Step[1350/1602], Avg Loss: 3.8994, Avg Acc: 0.2220
+INFO:local_logger:Epoch[046/300], Step[1350/1602], Avg Loss: 3.9279, Avg Acc: 0.2186
+INFO:local_logger:Epoch[046/300], Step[1350/1602], Avg Loss: 3.9214, Avg Acc: 0.2208
+INFO:local_logger:Epoch[046/300], Step[1350/1602], Avg Loss: 3.9350, Avg Acc: 0.2235
+INFO:master_logger:Epoch[046/300], Step[1350/1602], Avg Loss: 3.9210, Avg Acc: 0.2212
+INFO:local_logger:Epoch[046/300], Step[1400/1602], Avg Loss: 3.9003, Avg Acc: 0.2221
+INFO:local_logger:Epoch[046/300], Step[1400/1602], Avg Loss: 3.9252, Avg Acc: 0.2181
+INFO:local_logger:Epoch[046/300], Step[1400/1602], Avg Loss: 3.9210, Avg Acc: 0.2214
+INFO:local_logger:Epoch[046/300], Step[1400/1602], Avg Loss: 3.9355, Avg Acc: 0.2236
+INFO:master_logger:Epoch[046/300], Step[1400/1602], Avg Loss: 3.9205, Avg Acc: 0.2213
+INFO:local_logger:Epoch[046/300], Step[1450/1602], Avg Loss: 3.9039, Avg Acc: 0.2223
+INFO:local_logger:Epoch[046/300], Step[1450/1602], Avg Loss: 3.9282, Avg Acc: 0.2174
+INFO:local_logger:Epoch[046/300], Step[1450/1602], Avg Loss: 3.9190, Avg Acc: 0.2204
+INFO:local_logger:Epoch[046/300], Step[1450/1602], Avg Loss: 3.9346, Avg Acc: 0.2225
+INFO:master_logger:Epoch[046/300], Step[1450/1602], Avg Loss: 3.9214, Avg Acc: 0.2206
+INFO:local_logger:Epoch[046/300], Step[1500/1602], Avg Loss: 3.9263, Avg Acc: 0.2183
+INFO:local_logger:Epoch[046/300], Step[1500/1602], Avg Loss: 3.9332, Avg Acc: 0.2232
+INFO:local_logger:Epoch[046/300], Step[1500/1602], Avg Loss: 3.9206, Avg Acc: 0.2192
+INFO:local_logger:Epoch[046/300], Step[1500/1602], Avg Loss: 3.9025, Avg Acc: 0.2233
+INFO:master_logger:Epoch[046/300], Step[1500/1602], Avg Loss: 3.9207, Avg Acc: 0.2210
+INFO:local_logger:Epoch[046/300], Step[1550/1602], Avg Loss: 3.9218, Avg Acc: 0.2188
+INFO:local_logger:Epoch[046/300], Step[1550/1602], Avg Loss: 3.9247, Avg Acc: 0.2188
+INFO:local_logger:Epoch[046/300], Step[1550/1602], Avg Loss: 3.9024, Avg Acc: 0.2235
+INFO:local_logger:Epoch[046/300], Step[1550/1602], Avg Loss: 3.9333, Avg Acc: 0.2227
+INFO:master_logger:Epoch[046/300], Step[1550/1602], Avg Loss: 3.9206, Avg Acc: 0.2210
+INFO:local_logger:Epoch[046/300], Step[1600/1602], Avg Loss: 3.9220, Avg Acc: 0.2203
+INFO:local_logger:Epoch[046/300], Step[1600/1602], Avg Loss: 3.9031, Avg Acc: 0.2231
+INFO:local_logger:Epoch[046/300], Step[1600/1602], Avg Loss: 3.9200, Avg Acc: 0.2192
+INFO:local_logger:Epoch[046/300], Step[1600/1602], Avg Loss: 3.9343, Avg Acc: 0.2220
+INFO:master_logger:Epoch[046/300], Step[1600/1602], Avg Loss: 3.9198, Avg Acc: 0.2211
+INFO:local_logger:----- Epoch[046/300], Train Loss: 3.9033, Train Acc: 0.2230, time: 3713.38
+INFO:local_logger:Now training epoch 47. LR=0.000372
+INFO:local_logger:----- Epoch[046/300], Train Loss: 3.9344, Train Acc: 0.2220, time: 3713.71
+INFO:local_logger:Now training epoch 47. LR=0.000372
+INFO:local_logger:----- Epoch[046/300], Train Loss: 3.9220, Train Acc: 0.2203, time: 3713.13
+INFO:master_logger:----- Epoch[046/300], Train Loss: 3.9199, Train Acc: 0.2211, time: 3713.13
+INFO:local_logger:----- Epoch[046/300], Train Loss: 3.9200, Train Acc: 0.2191, time: 3713.37
+INFO:local_logger:Now training epoch 47. LR=0.000372
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-46-Loss-3.9219932130963167.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-46-Loss-3.9219932130963167.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-46-Loss-3.9219932130963167-EMA.pdparams
+INFO:local_logger:Now training epoch 47. LR=0.000372
+INFO:master_logger:Now training epoch 47. LR=0.000372
+INFO:local_logger:Epoch[047/300], Step[0000/1602], Avg Loss: 4.5305, Avg Acc: 0.1400
+INFO:local_logger:Epoch[047/300], Step[0000/1602], Avg Loss: 3.5485, Avg Acc: 0.0050
+INFO:local_logger:Epoch[047/300], Step[0000/1602], Avg Loss: 4.1198, Avg Acc: 0.0400
+INFO:local_logger:Epoch[047/300], Step[0000/1602], Avg Loss: 4.3601, Avg Acc: 0.0400
+INFO:master_logger:Epoch[047/300], Step[0000/1602], Avg Loss: 4.1397, Avg Acc: 0.0562
+INFO:local_logger:Epoch[047/300], Step[0050/1602], Avg Loss: 3.8951, Avg Acc: 0.2268
+INFO:local_logger:Epoch[047/300], Step[0050/1602], Avg Loss: 3.9732, Avg Acc: 0.2004
+INFO:master_logger:Epoch[047/300], Step[0050/1602], Avg Loss: 3.9515, Avg Acc: 0.2153
+INFO:local_logger:Epoch[047/300], Step[0050/1602], Avg Loss: 3.9500, Avg Acc: 0.2238
+INFO:local_logger:Epoch[047/300], Step[0050/1602], Avg Loss: 3.9877, Avg Acc: 0.2104
+INFO:local_logger:Epoch[047/300], Step[0100/1602], Avg Loss: 3.9781, Avg Acc: 0.2020
+INFO:local_logger:Epoch[047/300], Step[0100/1602], Avg Loss: 3.8738, Avg Acc: 0.2425
+INFO:local_logger:Epoch[047/300], Step[0100/1602], Avg Loss: 3.8884, Avg Acc: 0.2225
+INFO:local_logger:Epoch[047/300], Step[0100/1602], Avg Loss: 3.9200, Avg Acc: 0.2200
+INFO:master_logger:Epoch[047/300], Step[0100/1602], Avg Loss: 3.9151, Avg Acc: 0.2217
+INFO:local_logger:Epoch[047/300], Step[0150/1602], Avg Loss: 3.8677, Avg Acc: 0.2414
+INFO:local_logger:Epoch[047/300], Step[0150/1602], Avg Loss: 3.9542, Avg Acc: 0.2121
+INFO:local_logger:Epoch[047/300], Step[0150/1602], Avg Loss: 3.8898, Avg Acc: 0.2251
+INFO:local_logger:Epoch[047/300], Step[0150/1602], Avg Loss: 3.8999, Avg Acc: 0.2191
+INFO:master_logger:Epoch[047/300], Step[0150/1602], Avg Loss: 3.9029, Avg Acc: 0.2244
+INFO:local_logger:Epoch[047/300], Step[0200/1602], Avg Loss: 3.9287, Avg Acc: 0.2073
+INFO:local_logger:Epoch[047/300], Step[0200/1602], Avg Loss: 3.9073, Avg Acc: 0.2378
+INFO:local_logger:Epoch[047/300], Step[0200/1602], Avg Loss: 3.9174, Avg Acc: 0.2240
+INFO:local_logger:Epoch[047/300], Step[0200/1602], Avg Loss: 3.9490, Avg Acc: 0.2118
+INFO:master_logger:Epoch[047/300], Step[0200/1602], Avg Loss: 3.9256, Avg Acc: 0.2202
+INFO:local_logger:Epoch[047/300], Step[0250/1602], Avg Loss: 3.9456, Avg Acc: 0.2372
+INFO:local_logger:Epoch[047/300], Step[0250/1602], Avg Loss: 3.9057, Avg Acc: 0.2154
+INFO:local_logger:Epoch[047/300], Step[0250/1602], Avg Loss: 3.9239, Avg Acc: 0.2125
+INFO:master_logger:Epoch[047/300], Step[0250/1602], Avg Loss: 3.9151, Avg Acc: 0.2221
+INFO:local_logger:Epoch[047/300], Step[0250/1602], Avg Loss: 3.8853, Avg Acc: 0.2231
+INFO:local_logger:Epoch[047/300], Step[0300/1602], Avg Loss: 3.9447, Avg Acc: 0.2317
+INFO:master_logger:Epoch[047/300], Step[0300/1602], Avg Loss: 3.9124, Avg Acc: 0.2233
+INFO:local_logger:Epoch[047/300], Step[0300/1602], Avg Loss: 3.8913, Avg Acc: 0.2216
+INFO:local_logger:Epoch[047/300], Step[0300/1602], Avg Loss: 3.8997, Avg Acc: 0.2246
+INFO:local_logger:Epoch[047/300], Step[0300/1602], Avg Loss: 3.9137, Avg Acc: 0.2153
+INFO:local_logger:Epoch[047/300], Step[0350/1602], Avg Loss: 3.9461, Avg Acc: 0.2263
+INFO:local_logger:Epoch[047/300], Step[0350/1602], Avg Loss: 3.9061, Avg Acc: 0.2218
+INFO:local_logger:Epoch[047/300], Step[0350/1602], Avg Loss: 3.9139, Avg Acc: 0.2240
+INFO:local_logger:Epoch[047/300], Step[0350/1602], Avg Loss: 3.8990, Avg Acc: 0.2204
+INFO:master_logger:Epoch[047/300], Step[0350/1602], Avg Loss: 3.9163, Avg Acc: 0.2231
+INFO:local_logger:Epoch[047/300], Step[0400/1602], Avg Loss: 3.9476, Avg Acc: 0.2257
+INFO:local_logger:Epoch[047/300], Step[0400/1602], Avg Loss: 3.9177, Avg Acc: 0.2206
+INFO:local_logger:Epoch[047/300], Step[0400/1602], Avg Loss: 3.8791, Avg Acc: 0.2203
+INFO:master_logger:Epoch[047/300], Step[0400/1602], Avg Loss: 3.9137, Avg Acc: 0.2238
+INFO:local_logger:Epoch[047/300], Step[0400/1602], Avg Loss: 3.9103, Avg Acc: 0.2285
+INFO:local_logger:Epoch[047/300], Step[0450/1602], Avg Loss: 3.9332, Avg Acc: 0.2274
+INFO:local_logger:Epoch[047/300], Step[0450/1602], Avg Loss: 3.9142, Avg Acc: 0.2289
+INFO:local_logger:Epoch[047/300], Step[0450/1602], Avg Loss: 3.8841, Avg Acc: 0.2221
+INFO:local_logger:Epoch[047/300], Step[0450/1602], Avg Loss: 3.9147, Avg Acc: 0.2229
+INFO:master_logger:Epoch[047/300], Step[0450/1602], Avg Loss: 3.9116, Avg Acc: 0.2254
+INFO:local_logger:Epoch[047/300], Step[0500/1602], Avg Loss: 3.9121, Avg Acc: 0.2278
+INFO:local_logger:Epoch[047/300], Step[0500/1602], Avg Loss: 3.9231, Avg Acc: 0.2287
+INFO:master_logger:Epoch[047/300], Step[0500/1602], Avg Loss: 3.9071, Avg Acc: 0.2262
+INFO:local_logger:Epoch[047/300], Step[0500/1602], Avg Loss: 3.9146, Avg Acc: 0.2242
+INFO:local_logger:Epoch[047/300], Step[0500/1602], Avg Loss: 3.8788, Avg Acc: 0.2241
+INFO:local_logger:Epoch[047/300], Step[0550/1602], Avg Loss: 3.9146, Avg Acc: 0.2279
+INFO:local_logger:Epoch[047/300], Step[0550/1602], Avg Loss: 3.8907, Avg Acc: 0.2225
+INFO:local_logger:Epoch[047/300], Step[0550/1602], Avg Loss: 3.9047, Avg Acc: 0.2300
+INFO:local_logger:Epoch[047/300], Step[0550/1602], Avg Loss: 3.9166, Avg Acc: 0.2196
+INFO:master_logger:Epoch[047/300], Step[0550/1602], Avg Loss: 3.9067, Avg Acc: 0.2250
+INFO:local_logger:Epoch[047/300], Step[0600/1602], Avg Loss: 3.9149, Avg Acc: 0.2266
+INFO:local_logger:Epoch[047/300], Step[0600/1602], Avg Loss: 3.9233, Avg Acc: 0.2218
+INFO:local_logger:Epoch[047/300], Step[0600/1602], Avg Loss: 3.9001, Avg Acc: 0.2280
+INFO:master_logger:Epoch[047/300], Step[0600/1602], Avg Loss: 3.9059, Avg Acc: 0.2252
+INFO:local_logger:Epoch[047/300], Step[0600/1602], Avg Loss: 3.8854, Avg Acc: 0.2245
+INFO:local_logger:Epoch[047/300], Step[0650/1602], Avg Loss: 3.9289, Avg Acc: 0.2236
+INFO:local_logger:Epoch[047/300], Step[0650/1602], Avg Loss: 3.9110, Avg Acc: 0.2230
+INFO:local_logger:Epoch[047/300], Step[0650/1602], Avg Loss: 3.8951, Avg Acc: 0.2280
+INFO:local_logger:Epoch[047/300], Step[0650/1602], Avg Loss: 3.8934, Avg Acc: 0.2255
+INFO:master_logger:Epoch[047/300], Step[0650/1602], Avg Loss: 3.9071, Avg Acc: 0.2250
+INFO:local_logger:Epoch[047/300], Step[0700/1602], Avg Loss: 3.8858, Avg Acc: 0.2284
+INFO:local_logger:Epoch[047/300], Step[0700/1602], Avg Loss: 3.9302, Avg Acc: 0.2236
+INFO:local_logger:Epoch[047/300], Step[0700/1602], Avg Loss: 3.9057, Avg Acc: 0.2202
+INFO:local_logger:Epoch[047/300], Step[0700/1602], Avg Loss: 3.8932, Avg Acc: 0.2246
+INFO:master_logger:Epoch[047/300], Step[0700/1602], Avg Loss: 3.9037, Avg Acc: 0.2242
+INFO:local_logger:Epoch[047/300], Step[0750/1602], Avg Loss: 3.9294, Avg Acc: 0.2222
+INFO:local_logger:Epoch[047/300], Step[0750/1602], Avg Loss: 3.8938, Avg Acc: 0.2272
+INFO:local_logger:Epoch[047/300], Step[0750/1602], Avg Loss: 3.9007, Avg Acc: 0.2194
+INFO:local_logger:Epoch[047/300], Step[0750/1602], Avg Loss: 3.9002, Avg Acc: 0.2239
+INFO:master_logger:Epoch[047/300], Step[0750/1602], Avg Loss: 3.9060, Avg Acc: 0.2232
+INFO:local_logger:Epoch[047/300], Step[0800/1602], Avg Loss: 3.9221, Avg Acc: 0.2228
+INFO:local_logger:Epoch[047/300], Step[0800/1602], Avg Loss: 3.8999, Avg Acc: 0.2230
+INFO:local_logger:Epoch[047/300], Step[0800/1602], Avg Loss: 3.8975, Avg Acc: 0.2278
+INFO:local_logger:Epoch[047/300], Step[0800/1602], Avg Loss: 3.9064, Avg Acc: 0.2181
+INFO:master_logger:Epoch[047/300], Step[0800/1602], Avg Loss: 3.9065, Avg Acc: 0.2229
+INFO:local_logger:Epoch[047/300], Step[0850/1602], Avg Loss: 3.9247, Avg Acc: 0.2232
+INFO:local_logger:Epoch[047/300], Step[0850/1602], Avg Loss: 3.8912, Avg Acc: 0.2272
+INFO:local_logger:Epoch[047/300], Step[0850/1602], Avg Loss: 3.9154, Avg Acc: 0.2173
+INFO:local_logger:Epoch[047/300], Step[0850/1602], Avg Loss: 3.8972, Avg Acc: 0.2243
+INFO:master_logger:Epoch[047/300], Step[0850/1602], Avg Loss: 3.9071, Avg Acc: 0.2230
+INFO:local_logger:Epoch[047/300], Step[0900/1602], Avg Loss: 3.9182, Avg Acc: 0.2253
+INFO:local_logger:Epoch[047/300], Step[0900/1602], Avg Loss: 3.8928, Avg Acc: 0.2256
+INFO:local_logger:Epoch[047/300], Step[0900/1602], Avg Loss: 3.9199, Avg Acc: 0.2186
+INFO:local_logger:Epoch[047/300], Step[0900/1602], Avg Loss: 3.8990, Avg Acc: 0.2248
+INFO:master_logger:Epoch[047/300], Step[0900/1602], Avg Loss: 3.9075, Avg Acc: 0.2236
+INFO:local_logger:Epoch[047/300], Step[0950/1602], Avg Loss: 3.9161, Avg Acc: 0.2259
+INFO:local_logger:Epoch[047/300], Step[0950/1602], Avg Loss: 3.8954, Avg Acc: 0.2261
+INFO:local_logger:Epoch[047/300], Step[0950/1602], Avg Loss: 3.8917, Avg Acc: 0.2249
+INFO:master_logger:Epoch[047/300], Step[0950/1602], Avg Loss: 3.9040, Avg Acc: 0.2236
+INFO:local_logger:Epoch[047/300], Step[0950/1602], Avg Loss: 3.9129, Avg Acc: 0.2174
+INFO:local_logger:Epoch[047/300], Step[1000/1602], Avg Loss: 3.9067, Avg Acc: 0.2259
+INFO:local_logger:Epoch[047/300], Step[1000/1602], Avg Loss: 3.8964, Avg Acc: 0.2249
+INFO:local_logger:Epoch[047/300], Step[1000/1602], Avg Loss: 3.9103, Avg Acc: 0.2189
+INFO:local_logger:Epoch[047/300], Step[1000/1602], Avg Loss: 3.8925, Avg Acc: 0.2251
+INFO:master_logger:Epoch[047/300], Step[1000/1602], Avg Loss: 3.9015, Avg Acc: 0.2237
+INFO:local_logger:Epoch[047/300], Step[1050/1602], Avg Loss: 3.9113, Avg Acc: 0.2270
+INFO:master_logger:Epoch[047/300], Step[1050/1602], Avg Loss: 3.9009, Avg Acc: 0.2237
+INFO:local_logger:Epoch[047/300], Step[1050/1602], Avg Loss: 3.8933, Avg Acc: 0.2249
+INFO:local_logger:Epoch[047/300], Step[1050/1602], Avg Loss: 3.8991, Avg Acc: 0.2243
+INFO:local_logger:Epoch[047/300], Step[1050/1602], Avg Loss: 3.8999, Avg Acc: 0.2185
+INFO:local_logger:Epoch[047/300], Step[1100/1602], Avg Loss: 3.9073, Avg Acc: 0.2265
+INFO:local_logger:Epoch[047/300], Step[1100/1602], Avg Loss: 3.9087, Avg Acc: 0.2232
+INFO:local_logger:Epoch[047/300], Step[1100/1602], Avg Loss: 3.8976, Avg Acc: 0.2194
+INFO:local_logger:Epoch[047/300], Step[1100/1602], Avg Loss: 3.8922, Avg Acc: 0.2261
+INFO:master_logger:Epoch[047/300], Step[1100/1602], Avg Loss: 3.9014, Avg Acc: 0.2238
+INFO:local_logger:Epoch[047/300], Step[1150/1602], Avg Loss: 3.9053, Avg Acc: 0.2263
+INFO:local_logger:Epoch[047/300], Step[1150/1602], Avg Loss: 3.9088, Avg Acc: 0.2234
+INFO:local_logger:Epoch[047/300], Step[1150/1602], Avg Loss: 3.8939, Avg Acc: 0.2213
+INFO:local_logger:Epoch[047/300], Step[1150/1602], Avg Loss: 3.8932, Avg Acc: 0.2249
+INFO:master_logger:Epoch[047/300], Step[1150/1602], Avg Loss: 3.9003, Avg Acc: 0.2239
+INFO:local_logger:Epoch[047/300], Step[1200/1602], Avg Loss: 3.9083, Avg Acc: 0.2260
+INFO:local_logger:Epoch[047/300], Step[1200/1602], Avg Loss: 3.8974, Avg Acc: 0.2208
+INFO:local_logger:Epoch[047/300], Step[1200/1602], Avg Loss: 3.8949, Avg Acc: 0.2241
+INFO:local_logger:Epoch[047/300], Step[1200/1602], Avg Loss: 3.9058, Avg Acc: 0.2255
+INFO:master_logger:Epoch[047/300], Step[1200/1602], Avg Loss: 3.9016, Avg Acc: 0.2241
+INFO:local_logger:Epoch[047/300], Step[1250/1602], Avg Loss: 3.9050, Avg Acc: 0.2269
+INFO:local_logger:Epoch[047/300], Step[1250/1602], Avg Loss: 3.8893, Avg Acc: 0.2244
+INFO:local_logger:Epoch[047/300], Step[1250/1602], Avg Loss: 3.8990, Avg Acc: 0.2213
+INFO:master_logger:Epoch[047/300], Step[1250/1602], Avg Loss: 3.9000, Avg Acc: 0.2247
+INFO:local_logger:Epoch[047/300], Step[1250/1602], Avg Loss: 3.9067, Avg Acc: 0.2261
+INFO:local_logger:Epoch[047/300], Step[1300/1602], Avg Loss: 3.9053, Avg Acc: 0.2254
+INFO:local_logger:Epoch[047/300], Step[1300/1602], Avg Loss: 3.8871, Avg Acc: 0.2231
+INFO:local_logger:Epoch[047/300], Step[1300/1602], Avg Loss: 3.9110, Avg Acc: 0.2257
+INFO:local_logger:Epoch[047/300], Step[1300/1602], Avg Loss: 3.8987, Avg Acc: 0.2230
+INFO:master_logger:Epoch[047/300], Step[1300/1602], Avg Loss: 3.9006, Avg Acc: 0.2243
+INFO:local_logger:Epoch[047/300], Step[1350/1602], Avg Loss: 3.9072, Avg Acc: 0.2261
+INFO:local_logger:Epoch[047/300], Step[1350/1602], Avg Loss: 3.8854, Avg Acc: 0.2244
+INFO:local_logger:Epoch[047/300], Step[1350/1602], Avg Loss: 3.9110, Avg Acc: 0.2257
+INFO:master_logger:Epoch[047/300], Step[1350/1602], Avg Loss: 3.9003, Avg Acc: 0.2246
+INFO:local_logger:Epoch[047/300], Step[1350/1602], Avg Loss: 3.8977, Avg Acc: 0.2220
+INFO:local_logger:Epoch[047/300], Step[1400/1602], Avg Loss: 3.9060, Avg Acc: 0.2270
+INFO:local_logger:Epoch[047/300], Step[1400/1602], Avg Loss: 3.9001, Avg Acc: 0.2209
+INFO:local_logger:Epoch[047/300], Step[1400/1602], Avg Loss: 3.9084, Avg Acc: 0.2243
+INFO:local_logger:Epoch[047/300], Step[1400/1602], Avg Loss: 3.8922, Avg Acc: 0.2237
+INFO:master_logger:Epoch[047/300], Step[1400/1602], Avg Loss: 3.9017, Avg Acc: 0.2240
+INFO:local_logger:Epoch[047/300], Step[1450/1602], Avg Loss: 3.9088, Avg Acc: 0.2242
+INFO:local_logger:Epoch[047/300], Step[1450/1602], Avg Loss: 3.9062, Avg Acc: 0.2262
+INFO:local_logger:Epoch[047/300], Step[1450/1602], Avg Loss: 3.8993, Avg Acc: 0.2215
+INFO:local_logger:Epoch[047/300], Step[1450/1602], Avg Loss: 3.8964, Avg Acc: 0.2233
+INFO:master_logger:Epoch[047/300], Step[1450/1602], Avg Loss: 3.9027, Avg Acc: 0.2238
+INFO:local_logger:Epoch[047/300], Step[1500/1602], Avg Loss: 3.9032, Avg Acc: 0.2271
+INFO:local_logger:Epoch[047/300], Step[1500/1602], Avg Loss: 3.8953, Avg Acc: 0.2220
+INFO:local_logger:Epoch[047/300], Step[1500/1602], Avg Loss: 3.8997, Avg Acc: 0.2225
+INFO:local_logger:Epoch[047/300], Step[1500/1602], Avg Loss: 3.9068, Avg Acc: 0.2247
+INFO:master_logger:Epoch[047/300], Step[1500/1602], Avg Loss: 3.9013, Avg Acc: 0.2241
+INFO:local_logger:Epoch[047/300], Step[1550/1602], Avg Loss: 3.9071, Avg Acc: 0.2261
+INFO:local_logger:Epoch[047/300], Step[1550/1602], Avg Loss: 3.8968, Avg Acc: 0.2232
+INFO:local_logger:Epoch[047/300], Step[1550/1602], Avg Loss: 3.8995, Avg Acc: 0.2226
+INFO:local_logger:Epoch[047/300], Step[1550/1602], Avg Loss: 3.9082, Avg Acc: 0.2250
+INFO:master_logger:Epoch[047/300], Step[1550/1602], Avg Loss: 3.9029, Avg Acc: 0.2242
+INFO:local_logger:Epoch[047/300], Step[1600/1602], Avg Loss: 3.8985, Avg Acc: 0.2231
+INFO:local_logger:Epoch[047/300], Step[1600/1602], Avg Loss: 3.9075, Avg Acc: 0.2261
+INFO:local_logger:Epoch[047/300], Step[1600/1602], Avg Loss: 3.8983, Avg Acc: 0.2246
+INFO:local_logger:Epoch[047/300], Step[1600/1602], Avg Loss: 3.9055, Avg Acc: 0.2258
+INFO:master_logger:Epoch[047/300], Step[1600/1602], Avg Loss: 3.9025, Avg Acc: 0.2249
+INFO:local_logger:----- Epoch[047/300], Train Loss: 3.9073, Train Acc: 0.2262, time: 3691.70
+INFO:master_logger:----- Epoch[047/300], Train Loss: 3.9025, Train Acc: 0.2249, time: 3691.70
+INFO:local_logger:----- Epoch[047/300], Train Loss: 3.9057, Train Acc: 0.2258, time: 3691.96
+INFO:local_logger:----- Epoch[047/300], Train Loss: 3.8985, Train Acc: 0.2231, time: 3691.96
+INFO:local_logger:Now training epoch 48. LR=0.000371
+INFO:local_logger:Now training epoch 48. LR=0.000371
+INFO:local_logger:----- Epoch[047/300], Train Loss: 3.8984, Train Acc: 0.2246, time: 3692.31
+INFO:local_logger:Now training epoch 48. LR=0.000371
+INFO:master_logger:----- Save model: ./output/train-20211019-17-32-41/DeiT-Epoch-47-Loss-3.9072817984569763.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211019-17-32-41/DeiT-Epoch-47-Loss-3.9072817984569763.pdopt
+INFO:master_logger:----- Save ema model: ./output/train-20211019-17-32-41/DeiT-Epoch-47-Loss-3.9072817984569763-EMA.pdparams
+INFO:local_logger:Now training epoch 48. LR=0.000371
+INFO:master_logger:Now training epoch 48. LR=0.000371
+INFO:local_logger:Epoch[048/300], Step[0000/1602], Avg Loss: 3.4853, Avg Acc: 0.4350
+INFO:master_logger:Epoch[048/300], Step[0000/1602], Avg Loss: 3.7585, Avg Acc: 0.3525
+INFO:local_logger:Epoch[048/300], Step[0000/1602], Avg Loss: 3.3312, Avg Acc: 0.4150
+INFO:local_logger:Epoch[048/300], Step[0000/1602], Avg Loss: 4.2127, Avg Acc: 0.2250
+INFO:local_logger:Epoch[048/300], Step[0000/1602], Avg Loss: 4.0049, Avg Acc: 0.3350
+INFO:local_logger:Epoch[048/300], Step[0050/1602], Avg Loss: 3.8119, Avg Acc: 0.2387
+INFO:local_logger:Epoch[048/300], Step[0050/1602], Avg Loss: 3.9108, Avg Acc: 0.1822
+INFO:local_logger:Epoch[048/300], Step[0050/1602], Avg Loss: 3.9080, Avg Acc: 0.2126
+INFO:master_logger:Epoch[048/300], Step[0050/1602], Avg Loss: 3.8787, Avg Acc: 0.2190
+INFO:local_logger:Epoch[048/300], Step[0050/1602], Avg Loss: 3.8842, Avg Acc: 0.2425
+INFO:local_logger:Epoch[048/300], Step[0100/1602], Avg Loss: 3.8033, Avg Acc: 0.2432
+INFO:local_logger:Epoch[048/300], Step[0100/1602], Avg Loss: 3.8314, Avg Acc: 0.2187
+INFO:local_logger:Epoch[048/300], Step[0100/1602], Avg Loss: 3.8992, Avg Acc: 0.2452
+INFO:local_logger:Epoch[048/300], Step[0100/1602], Avg Loss: 3.9043, Avg Acc: 0.2323
+INFO:master_logger:Epoch[048/300], Step[0100/1602], Avg Loss: 3.8595, Avg Acc: 0.2349
+INFO:local_logger:Epoch[048/300], Step[0150/1602], Avg Loss: 3.8201, Avg Acc: 0.2337
+INFO:local_logger:Epoch[048/300], Step[0150/1602], Avg Loss: 3.8861, Avg Acc: 0.2346
+INFO:local_logger:Epoch[048/300], Step[0150/1602], Avg Loss: 3.8954, Avg Acc: 0.2451
+INFO:local_logger:Epoch[048/300], Step[0150/1602], Avg Loss: 3.8569, Avg Acc: 0.2094
+INFO:master_logger:Epoch[048/300], Step[0150/1602], Avg Loss: 3.8646, Avg Acc: 0.2307
+INFO:local_logger:Epoch[048/300], Step[0200/1602], Avg Loss: 3.8375, Avg Acc: 0.2365
+INFO:local_logger:Epoch[048/300], Step[0200/1602], Avg Loss: 3.8983, Avg Acc: 0.2268
+INFO:local_logger:Epoch[048/300], Step[0200/1602], Avg Loss: 3.8843, Avg Acc: 0.2391
+INFO:local_logger:Epoch[048/300], Step[0200/1602], Avg Loss: 3.8602, Avg Acc: 0.2070
+INFO:master_logger:Epoch[048/300], Step[0200/1602], Avg Loss: 3.8701, Avg Acc: 0.2273
+INFO:local_logger:Epoch[048/300], Step[0250/1602], Avg Loss: 3.8543, Avg Acc: 0.2353
+INFO:local_logger:Epoch[048/300], Step[0250/1602], Avg Loss: 3.8954, Avg Acc: 0.2254
+INFO:local_logger:Epoch[048/300], Step[0250/1602], Avg Loss: 3.8793, Avg Acc: 0.2371
+INFO:master_logger:Epoch[048/300], Step[0250/1602], Avg Loss: 3.8788, Avg Acc: 0.2273
+INFO:local_logger:Epoch[048/300], Step[0250/1602], Avg Loss: 3.8862, Avg Acc: 0.2114
+INFO:local_logger:Epoch[048/300], Step[0300/1602], Avg Loss: 3.8863, Avg Acc: 0.2173
+INFO:local_logger:Epoch[048/300], Step[0300/1602], Avg Loss: 3.8580, Avg Acc: 0.2368
+INFO:local_logger:Epoch[048/300], Step[0300/1602], Avg Loss: 3.8940, Avg Acc: 0.2292
+INFO:local_logger:Epoch[048/300], Step[0300/1602], Avg Loss: 3.8836, Avg Acc: 0.2410
+INFO:master_logger:Epoch[048/300], Step[0300/1602], Avg Loss: 3.8805, Avg Acc: 0.2311
+INFO:local_logger:Epoch[048/300], Step[0350/1602], Avg Loss: 3.8813, Avg Acc: 0.2311
+INFO:local_logger:Epoch[048/300], Step[0350/1602], Avg Loss: 3.8965, Avg Acc: 0.2388
+INFO:local_logger:Epoch[048/300], Step[0350/1602], Avg Loss: 3.8761, Avg Acc: 0.2333
+INFO:local_logger:Epoch[048/300], Step[0350/1602], Avg Loss: 3.8782, Avg Acc: 0.2177
+INFO:master_logger:Epoch[048/300], Step[0350/1602], Avg Loss: 3.8830, Avg Acc: 0.2302
+INFO:local_logger:Epoch[048/300], Step[0400/1602], Avg Loss: 3.8594, Avg Acc: 0.2337
+INFO:local_logger:Epoch[048/300], Step[0400/1602], Avg Loss: 3.8899, Avg Acc: 0.2434
+INFO:local_logger:Epoch[048/300], Step[0400/1602], Avg Loss: 3.8843, Avg Acc: 0.2300
+INFO:local_logger:Epoch[048/300], Step[0400/1602], Avg Loss: 3.8875, Avg Acc: 0.2223
+INFO:master_logger:Epoch[048/300], Step[0400/1602], Avg Loss: 3.8803, Avg Acc: 0.2324
+INFO:local_logger:Epoch[048/300], Step[0450/1602], Avg Loss: 3.8818, Avg Acc: 0.2306
+INFO:local_logger:Epoch[048/300], Step[0450/1602], Avg Loss: 3.8793, Avg Acc: 0.2201
+INFO:master_logger:Epoch[048/300], Step[0450/1602], Avg Loss: 3.8780, Avg Acc: 0.2305
+INFO:local_logger:Epoch[048/300], Step[0450/1602], Avg Loss: 3.8903, Avg Acc: 0.2390
+INFO:local_logger:Epoch[048/300], Step[0450/1602], Avg Loss: 3.8605, Avg Acc: 0.2325
+INFO:local_logger:Epoch[048/300], Step[0500/1602], Avg Loss: 3.8862, Avg Acc: 0.2317
+INFO:local_logger:Epoch[048/300], Step[0500/1602], Avg Loss: 3.8903, Avg Acc: 0.2369
+INFO:local_logger:Epoch[048/300], Step[0500/1602], Avg Loss: 3.8572, Avg Acc: 0.2302
+INFO:master_logger:Epoch[048/300], Step[0500/1602], Avg Loss: 3.8777, Avg Acc: 0.2299
+INFO:local_logger:Epoch[048/300], Step[0500/1602], Avg Loss: 3.8772, Avg Acc: 0.2211
+INFO:local_logger:Epoch[048/300], Step[0550/1602], Avg Loss: 3.8640, Avg Acc: 0.2292
+INFO:local_logger:Epoch[048/300], Step[0550/1602], Avg Loss: 3.8878, Avg Acc: 0.2355
+INFO:local_logger:Epoch[048/300], Step[0550/1602], Avg Loss: 3.8958, Avg Acc: 0.2378
+INFO:local_logger:Epoch[048/300], Step[0550/1602], Avg Loss: 3.8889, Avg Acc: 0.2210
+INFO:master_logger:Epoch[048/300], Step[0550/1602], Avg Loss: 3.8841, Avg Acc: 0.2309
+INFO:local_logger:Epoch[048/300], Step[0600/1602], Avg Loss: 3.8914, Avg Acc: 0.2357
+INFO:local_logger:Epoch[048/300], Step[0600/1602], Avg Loss: 3.8873, Avg Acc: 0.2201
+INFO:local_logger:Epoch[048/300], Step[0600/1602], Avg Loss: 3.8944, Avg Acc: 0.2355
+INFO:local_logger:Epoch[048/300], Step[0600/1602], Avg Loss: 3.8737, Avg Acc: 0.2271
+INFO:master_logger:Epoch[048/300], Step[0600/1602], Avg Loss: 3.8867, Avg Acc: 0.2296
+INFO:local_logger:Epoch[048/300], Step[0650/1602], Avg Loss: 3.8964, Avg Acc: 0.2338
+INFO:local_logger:Epoch[048/300], Step[0650/1602], Avg Loss: 3.8937, Avg Acc: 0.2352
+INFO:local_logger:Epoch[048/300], Step[0650/1602], Avg Loss: 3.8818, Avg Acc: 0.2290
+INFO:local_logger:Epoch[048/300], Step[0650/1602], Avg Loss: 3.8968, Avg Acc: 0.2187
+INFO:master_logger:Epoch[048/300], Step[0650/1602], Avg Loss: 3.8922, Avg Acc: 0.2292
+INFO:local_logger:Epoch[048/300], Step[0700/1602], Avg Loss: 3.8926, Avg Acc: 0.2358
+INFO:local_logger:Epoch[048/300], Step[0700/1602], Avg Loss: 3.8897, Avg Acc: 0.2292
+INFO:local_logger:Epoch[048/300], Step[0700/1602], Avg Loss: 3.8979, Avg Acc: 0.2220
+INFO:local_logger:Epoch[048/300], Step[0700/1602], Avg Loss: 3.8964, Avg Acc: 0.2329
+INFO:master_logger:Epoch[048/300], Step[0700/1602], Avg Loss: 3.8941, Avg Acc: 0.2300
+INFO:local_logger:Epoch[048/300], Step[0750/1602], Avg Loss: 3.8941, Avg Acc: 0.2213
+INFO:local_logger:Epoch[048/300], Step[0750/1602], Avg Loss: 3.8969, Avg Acc: 0.2343
+INFO:local_logger:Epoch[048/300], Step[0750/1602], Avg Loss: 3.8883, Avg Acc: 0.2323
+INFO:master_logger:Epoch[048/300], Step[0750/1602], Avg Loss: 3.8931, Avg Acc: 0.2304
+INFO:local_logger:Epoch[048/300], Step[0750/1602], Avg Loss: 3.8932, Avg Acc: 0.2339
+INFO:local_logger:Epoch[048/300], Step[0800/1602], Avg Loss: 3.8904, Avg Acc: 0.2327
+INFO:local_logger:Epoch[048/300], Step[0800/1602], Avg Loss: 3.8981, Avg Acc: 0.2191
+INFO:local_logger:Epoch[048/300], Step[0800/1602], Avg Loss: 3.9013, Avg Acc: 0.2324
+INFO:local_logger:Epoch[048/300], Step[0800/1602], Avg Loss: 3.8839, Avg Acc: 0.2325
+INFO:master_logger:Epoch[048/300], Step[0800/1602], Avg Loss: 3.8934, Avg Acc: 0.2292
+INFO:local_logger:Epoch[048/300], Step[0850/1602], Avg Loss: 3.8922, Avg Acc: 0.2314
+INFO:local_logger:Epoch[048/300], Step[0850/1602], Avg Loss: 3.8924, Avg Acc: 0.2318
+INFO:local_logger:Epoch[048/300], Step[0850/1602], Avg Loss: 3.9061, Avg Acc: 0.2179
+INFO:master_logger:Epoch[048/300], Step[0850/1602], Avg Loss: 3.8964, Avg Acc: 0.2293
+INFO:local_logger:Epoch[048/300], Step[0850/1602], Avg Loss: 3.8952, Avg Acc: 0.2360
+INFO:local_logger:Epoch[048/300], Step[0900/1602], Avg Loss: 3.8930, Avg Acc: 0.2300
+INFO:local_logger:Epoch[048/300], Step[0900/1602], Avg Loss: 3.9041, Avg Acc: 0.2341
+INFO:local_logger:Epoch[048/300], Step[0900/1602], Avg Loss: 3.8952, Avg Acc: 0.2309
+INFO:local_logger:Epoch[048/300], Step[0900/1602], Avg Loss: 3.9058, Avg Acc: 0.2198
+INFO:master_logger:Epoch[048/300], Step[0900/1602], Avg Loss: 3.8995, Avg Acc: 0.2287
+INFO:local_logger:Epoch[048/300], Step[0950/1602], Avg Loss: 3.8957, Avg Acc: 0.2292
+INFO:local_logger:Epoch[048/300], Step[0950/1602], Avg Loss: 3.9039, Avg Acc: 0.2289
+INFO:local_logger:Epoch[048/300], Step[0950/1602], Avg Loss: 3.9062, Avg Acc: 0.2330
+INFO:local_logger:Epoch[048/300], Step[0950/1602], Avg Loss: 3.9007, Avg Acc: 0.2226
+INFO:master_logger:Epoch[048/300], Step[0950/1602], Avg Loss: 3.9016, Avg Acc: 0.2284
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+0   paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1634813243 (unix time) try "date -d @1634813243" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x6791) received by PID 19967 (TID 0x7f3946d1e700) from PID 26513 ***]
+
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+No stack trace in paddle, may be caused by external reasons.
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1634813248 (unix time) try "date -d @1634813248" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x6791) received by PID 19928 (TID 0x7f31aa8ab700) from PID 26513 ***]
+
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 16 leaked semaphores to clean up at shutdown
+  len(cache))
+run_train_multi_tiny.sh: line 9: 19928 Terminated              CUDA_VISIBLE_DEVICES=4,5,6,7 python main_multi_gpu.py -cfg='./configs/deit_tiny_patch16_224.yaml' -dataset='imagenet2012' -batch_size=200 -data_path='/dataset/imagenet' -teacher_model='./regnety_160' -amp
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 11 leaked semaphores to clean up at shutdown
+  len(cache))
diff --git a/image_classification/DeiT/port_weights/__init__.py b/image_classification/DeiT/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/DeiT/port_weights/load_pytorch_weights.py b/image_classification/DeiT/port_weights/load_pytorch_weights.py
new file mode 100644
index 00000000..d6135d57
--- /dev/null
+++ b/image_classification/DeiT/port_weights/load_pytorch_weights.py
@@ -0,0 +1,192 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+from deit import *
+from config import *
+from stats import count_gelu, count_softmax, count_layernorm
+
+
+model_name = 'deit_tiny_distilled_patch16_224'
+cfg_name = 'deit_tiny_patch16_224'
+sz = 224
+
+config = get_config()
+parser = argparse.ArgumentParser('')
+parser.add_argument('-cfg', type=str, default=f'./configs/{cfg_name}.yaml')
+parser.add_argument('-dataset', type=str, default=None)
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-image_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-eval', action="store_true")
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-teacher_model', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+args = parser.parse_args()
+
+config = get_config()
+config = update_config(config, args)
+print(config)
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('cls_token', 'class_token'),
+        ('dist_token', 'distill_token'),
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.proj'),
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        th_prefix = f'blocks.{idx}'
+        pp_prefix = f'layers.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'), 
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'), 
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head'),
+        ('head_dist', 'head_distill')
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+
+    #paddle.set_device('cpu')
+    paddle_model = build_deit(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print('--------------')
+    print_model_named_buffers(paddle_model)
+    print('----------------------------------')
+
+    device = torch.device('cuda')
+    torch_model = torch.hub.load('facebookresearch/deit:main',
+                                 f'{model_name}', #'deit_base_distilled_patch16_224',
+                                 pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+    print_model_named_params(torch_model)
+    print('--------------')
+    print_model_named_buffers(torch_model)
+    print('----------------------------------')
+
+
+    #return
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    #x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+    
+    # save weights for paddle model
+    #model_path = os.path.join('./deit_base_distilled_patch16_224.pdparams')
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                    paddle.nn.LayerNorm: count_layernorm,
+                    paddle.nn.Softmax: count_softmax,
+    }
+    paddle.flops(paddle_model,
+                 input_size=(1, 3, sz, sz),
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/DeiT/random_erasing.py b/image_classification/DeiT/random_erasing.py
index 1252f85d..31eea465 100644
--- a/image_classification/DeiT/random_erasing.py
+++ b/image_classification/DeiT/random_erasing.py
@@ -22,10 +22,9 @@
 def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
     if per_pixel:
         return paddle.normal(shape=patch_size).astype(dtype)
-    elif rand_color:
+    if rand_color:
         return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
-    else:
-        return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
 
 
 class RandomErasing(object):
diff --git a/image_classification/DeiT/run_eval_multi.sh b/image_classification/DeiT/run_eval_multi.sh
index fc36cb0c..04188391 100644
--- a/image_classification/DeiT/run_eval_multi.sh
+++ b/image_classification/DeiT/run_eval_multi.sh
@@ -1,9 +1,10 @@
-CUDA_VISIBLE_DEVICES=0,1,2,3 \
+#CUDA_VISIBLE_DEVICES=0,1,2,3 \
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
 python main_multi_gpu.py \
--cfg='./configs/deit_base_patch16_224.yaml' \
+-cfg='./configs/deit_tiny_patch16_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=16 \
+-batch_size=32 \
 -data_path='/dataset/imagenet' \
 -eval \
--pretrained='./deit_base_distilled_patch16_224' \
+-pretrained='./deit_tiny_distilled_patch16_224' \
 -ngpus=4
diff --git a/image_classification/DeiT/run_train.sh b/image_classification/DeiT/run_train.sh
index 8452dd92..7736318b 100644
--- a/image_classification/DeiT/run_train.sh
+++ b/image_classification/DeiT/run_train.sh
@@ -4,5 +4,6 @@ python main_single_gpu.py \
 -dataset='imagenet2012' \
 -batch_size=4 \
 -data_path='/dataset/imagenet' \
--teacher_model='./regnety_160'
+-teacher_model='./regnety_160' \
+-amp
 #-pretrained='./deit_base_distilled_patch16_224'
diff --git a/image_classification/DeiT/run_train_multi.sh b/image_classification/DeiT/run_train_multi.sh
index 7ce3a4ab..1acc56ee 100644
--- a/image_classification/DeiT/run_train_multi.sh
+++ b/image_classification/DeiT/run_train_multi.sh
@@ -1,8 +1,9 @@
-CUDA_VISIBLE_DEVICES=4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/deit_base_patch16_224.yaml' \
 -dataset='imagenet2012' \
 -batch_size=4 \
 -data_path='/dataset/imagenet' \
--teacher_model='./regnety_160'
+-teacher_model='./regnety_160' \
+-amp
 #-pretrained='./deit_base_distilled_patch16_224'
diff --git a/image_classification/DeiT/run_train_multi_tiny.sh b/image_classification/DeiT/run_train_multi_tiny.sh
new file mode 100644
index 00000000..964d9619
--- /dev/null
+++ b/image_classification/DeiT/run_train_multi_tiny.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/deit_tiny_patch16_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=200 \
+-data_path='/dataset/imagenet' \
+-teacher_model='./regnety_160' \
+-amp \
+#-pretrained='./deit_base_distilled_patch16_224'
diff --git a/image_classification/DeiT/stats.py b/image_classification/DeiT/stats.py
new file mode 100644
index 00000000..aa621a20
--- /dev/null
+++ b/image_classification/DeiT/stats.py
@@ -0,0 +1,61 @@
+import os
+import glob
+import paddle
+from config import get_config
+from deit import build_deit as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/deit_base_patch16_384.yaml'
+input_size = (1, 3, 384, 384)
+#input_size = (1, 3, 224, 224)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/DeiT/transforms.py b/image_classification/DeiT/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/DeiT/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/CaiT/.config.py.swp b/image_classification/FF_Only/.run_eval.sh.swp
similarity index 65%
rename from image_classification/CaiT/.config.py.swp
rename to image_classification/FF_Only/.run_eval.sh.swp
index ed536a52..4613295c 100644
Binary files a/image_classification/CaiT/.config.py.swp and b/image_classification/FF_Only/.run_eval.sh.swp differ
diff --git a/image_classification/FF_Only/README.md b/image_classification/FF_Only/README.md
new file mode 100644
index 00000000..b682b0cb
--- /dev/null
+++ b/image_classification/FF_Only/README.md
@@ -0,0 +1,171 @@
+# Do You Even Need Attention? A Stack of Feed-Forward Layers Does
+Surprisingly Well on ImageNet, [arxiv](https://arxiv.org/abs/2105.02723) 
+
+PaddlePaddle training/validation code and pretrained models for **FF_Only**.
+
+The official pytorch implementation is [here](https://github.com/lukemelas/do-you-even-need-attention).
+
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./ffonly.png" alt="drawing" width="30%" height="30%"/>
+    <h4 align="center">FF_Only Model Overview</h4>
+</p>
+
+
+
+
+
+### Update 
+Update (2021-09-14): Code is released and ported weights are uploaded.
+
+## Models Zoo
+
+| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link |
+|--------------------------------|-------|-------|------------|----------|--------------|---------------|
+| ff_tiny            | 61.28 | 84.06 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/14bPRCwuY_nT852fBZxb9wzXzbPWNfbCG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nNE4Hh1Nrzl7FEiyaZutDA)(mjgd) |
+| ff_base       | 74.82 | 91.71 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1DHUg4oCi41ELazPCvYxCFeShPXE4wU3p/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1l-h6Cq4B8kZRvHKDTzhhUg)(m1jc) |
+
+> *The results are evaluated on ImageNet2012 validation set.
+>
+> Note: FF_Only weights are ported from [here](https://github.com/lukemelas/do-you-even-need-attention).
+
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./linear_base.pdparams`, to use the `linear_base` model in python:
+```python
+from config import get_config
+from resmlp import build_res_mlp as build_model
+# config files in ./configs/
+config = get_config('./configs/ff_base.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./linear_base.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate FF_Only model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/ff_base.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=8 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/linear_base  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+    -cfg=./configs/ff_base.yaml \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/linear_base  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the FF_Only model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/ff_base.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=32 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+<details>
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/ff_base.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{melaskyriazi2021doyoueven,
+  title={Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet},
+  author={Luke Melas-Kyriazi},
+  journal=arxiv,
+  year=2021
+}
+```
diff --git a/image_classification/FF_Only/__init__.py b/image_classification/FF_Only/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/FF_Only/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/FF_Only/augment.py b/image_classification/FF_Only/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/FF_Only/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/FF_Only/config.py b/image_classification/FF_Only/config.py
new file mode 100644
index 00000000..47ceef42
--- /dev/null
+++ b/image_classification/FF_Only/config.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'FF_Only'
+_C.MODEL.NAME = 'ff_tiny'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+# transformer settings
+_C.MODEL.MIXER = CN()
+_C.MODEL.MIXER.EMBED_DIM = 768
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 5e-4
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 20 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/FF_Only/configs/ff_base.yaml b/image_classification/FF_Only/configs/ff_base.yaml
new file mode 100644
index 00000000..b227d443
--- /dev/null
+++ b/image_classification/FF_Only/configs/ff_base.yaml
@@ -0,0 +1,10 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: FF_Only
+    NAME: ff_base
+    MIXER:
+        EMBED_DIM: 768
+
+
diff --git a/image_classification/FF_Only/configs/ff_tiny.yaml b/image_classification/FF_Only/configs/ff_tiny.yaml
new file mode 100644
index 00000000..72d5c0c4
--- /dev/null
+++ b/image_classification/FF_Only/configs/ff_tiny.yaml
@@ -0,0 +1,9 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: FF_Only
+    NAME: ff_tiny
+    MIXER:
+        EMBED_DIM: 192
+
diff --git a/image_classification/FF_Only/datasets.py b/image_classification/FF_Only/datasets.py
new file mode 100644
index 00000000..304df9a3
--- /dev/null
+++ b/image_classification/FF_Only/datasets.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/FF_Only/droppath.py b/image_classification/FF_Only/droppath.py
new file mode 100644
index 00000000..c8fe8048
--- /dev/null
+++ b/image_classification/FF_Only/droppath.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/FF_Only/ffonly.png b/image_classification/FF_Only/ffonly.png
new file mode 100644
index 00000000..8be9ea76
Binary files /dev/null and b/image_classification/FF_Only/ffonly.png differ
diff --git a/image_classification/FF_Only/ffonly.py b/image_classification/FF_Only/ffonly.py
new file mode 100644
index 00000000..91936f59
--- /dev/null
+++ b/image_classification/FF_Only/ffonly.py
@@ -0,0 +1,293 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for FF_Only
+"""
+
+from functools import partial
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+
+from droppath import DropPath
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+kaiming_normal_ = nn.initializer.KaimingNormal()
+
+
+class Identity(nn.Layer):
+    """Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    """MLP module
+
+    MLP using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> dwconv -> act -> dropout -> fc -> dropout
+
+    Args:
+        in_features (int): input features.
+        hidden_features (int): hidden features.
+        out_features (int): output features.
+        act_layer (nn.Layer): activation.
+        drop (float): dropout.
+    """
+
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        act_layer=nn.GELU,
+        drop=0.0,
+    ):
+
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class LinearBlock(nn.Layer):
+    """Basic model components"""
+
+    def __init__(
+        self,
+        dim,
+        mlp_ratio=4.0,
+        drop=0.0,
+        drop_path=0.0,
+        act_layer=nn.GELU,
+        norm_layer=nn.LayerNorm,
+        num_tokens=197,
+    ):
+        super().__init__()
+
+        # First stage
+        self.mlp1 = Mlp(
+            in_features=dim,
+            hidden_features=int(dim * mlp_ratio),
+            act_layer=act_layer,
+            drop=drop,
+        )
+        self.norm1 = norm_layer(dim)
+
+        # Second stage
+        self.mlp2 = Mlp(
+            in_features=num_tokens,
+            hidden_features=int(num_tokens * mlp_ratio),
+            act_layer=act_layer,
+            drop=drop,
+        )
+        self.norm2 = norm_layer(num_tokens)
+
+        # Dropout (or a variant)
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+
+    def forward(self, x):
+        x = x + self.drop_path(self.mlp1(self.norm1(x)))
+        x = x.transpose([0, 2, 1])
+        x = x + self.drop_path(self.mlp2(self.norm2(x)))
+        x = x.transpose([0, 2, 1])
+        return x
+
+
+class PatchEmbed(nn.Layer):
+    """Wraps a convolution"""
+
+    def __init__(self, patch_size=16, in_chans=3, embed_dim=768):
+        super().__init__()
+        self.proj = nn.Conv2D(
+            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
+        )
+
+    def forward(self, x):
+        x = self.proj(x)
+        return x
+
+
+class LearnedPositionalEncoding(nn.Layer):
+    """Learned positional encoding with dynamic interpolation at runtime"""
+
+    def __init__(self, height, width, embed_dim):
+        super().__init__()
+        self.height = height
+        self.width = width
+
+        self.pos_embed = self.create_parameter(
+            shape=[1, embed_dim, height, width], default_initializer=trunc_normal_
+        )
+        self.add_parameter("pos_embed", self.pos_embed)
+
+        self.cls_pos_embed = self.create_parameter(
+            shape=[1, 1, embed_dim], default_initializer=trunc_normal_
+        )
+        self.add_parameter("cls_pos_embed", self.cls_pos_embed)
+
+    def forward(self, x):
+        _, _, H, W = x.shape
+        if H == self.height and W == self.width:
+            pos_embed = self.pos_embed
+        else:
+            pos_embed = F.interpolate(
+                self.pos_embed, size=[H, W], mode="bilinear", align_corners=False
+            )
+        return self.cls_pos_embed, pos_embed
+
+
+class LinearVisionTransformer(nn.Layer):
+    """
+    Basically the same as the standard Vision Transformer, but with support for resizable
+    or sinusoidal positional embeddings.
+    """
+
+    def __init__(
+        self,
+        *,
+        patch_size=16,
+        in_chans=3,
+        num_classes=1000,
+        embed_dim=768,
+        depth=12,
+        mlp_ratio=4.0,
+        drop_rate=0.0,
+        drop_path_rate=0.0,
+        norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
+        positional_encoding="learned",
+        learned_positional_encoding_size=(14, 14),
+        block_cls=LinearBlock
+    ):
+        super().__init__()
+
+        # Config
+        self.num_classes = num_classes
+        self.patch_size = patch_size
+        self.num_features = self.embed_dim = embed_dim
+
+        # Patch embedding
+        self.patch_embed = PatchEmbed(
+            patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim
+        )
+
+        # Class token
+        self.cls_token = self.create_parameter(
+            shape=[1, 1, embed_dim], default_initializer=trunc_normal_
+        )
+        self.add_parameter("cls_token", self.cls_token)
+
+        # Positional encoding
+        if positional_encoding == "learned":
+            (
+                height,
+                width,
+            ) = self.learned_positional_encoding_size = learned_positional_encoding_size
+            self.pos_encoding = LearnedPositionalEncoding(height, width, embed_dim)
+        else:
+            raise NotImplementedError("Unsupposed positional encoding")
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # Stochastic depth
+        dpr = [x.item() for x in paddle.linspace(0, drop_path_rate, depth)]
+        self.blocks = nn.LayerList(
+            [
+                block_cls(
+                    dim=embed_dim,
+                    mlp_ratio=mlp_ratio,
+                    drop=drop_rate,
+                    drop_path=dpr[i],
+                    norm_layer=norm_layer,
+                    num_tokens=1 + (224 // patch_size) ** 2,
+                )
+                for i in range(depth)
+            ]
+        )
+        self.norm = norm_layer(embed_dim)
+
+        # Classifier head
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                zeros_(m.bias)
+        elif isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+
+    def forward_features(self, x):
+
+        # Patch embedding
+        B, _, _, _ = x.shape  # B x C x H x W
+        x = self.patch_embed(x)  # B x E x H//p x W//p
+
+        # Positional encoding
+        # NOTE: cls_pos_embed for compatibility with pretrained models
+        cls_pos_embed, pos_embed = self.pos_encoding(x)
+
+        # Flatten image, append class token, add positional encoding
+        cls_tokens = self.cls_token.expand([B, -1, -1])
+        x = x.flatten(2).transpose([0, 2, 1])  # flatten
+        x = paddle.concat((cls_tokens, x), axis=1)  # class token
+        pos_embed = pos_embed.flatten(2).transpose([0, 2, 1])  # flatten
+        pos_embed = paddle.concat([cls_pos_embed, pos_embed], axis=1)  # class pos emb
+        x = x + pos_embed
+        x = self.pos_drop(x)
+
+        # Transformer
+        for blk in self.blocks:
+            x = blk(x)
+
+        # Final layernorm
+        x = self.norm(x)
+        return x[:, 0]
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+
+def build_ffonly(config):
+    model = LinearVisionTransformer(
+        num_classes=config.MODEL.NUM_CLASSES,
+        embed_dim=config.MODEL.MIXER.EMBED_DIM,
+    )
+    return model
diff --git a/image_classification/FF_Only/losses.py b/image_classification/FF_Only/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/FF_Only/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/FF_Only/main_multi_gpu.py b/image_classification/FF_Only/main_multi_gpu.py
new file mode 100644
index 00000000..489688a3
--- /dev/null
+++ b/image_classification/FF_Only/main_multi_gpu.py
@@ -0,0 +1,581 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""FFOnly training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from ffonly import build_ffonly as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('FFOnly')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg
+        train_acc_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/FF_Only/main_single_gpu.py b/image_classification/FF_Only/main_single_gpu.py
new file mode 100644
index 00000000..4a9cbd27
--- /dev/null
+++ b/image_classification/FF_Only/main_single_gpu.py
@@ -0,0 +1,423 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""FFOnly training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from ffonly import build_ffonly as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('FFOnly')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/FF_Only/mixup.py b/image_classification/FF_Only/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/FF_Only/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/FF_Only/random_erasing.py b/image_classification/FF_Only/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/FF_Only/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/FF_Only/run_eval.sh b/image_classification/FF_Only/run_eval.sh
new file mode 100644
index 00000000..236b01d5
--- /dev/null
+++ b/image_classification/FF_Only/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/ff_base.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./linear_base'
diff --git a/image_classification/FF_Only/run_eval_multi.sh b/image_classification/FF_Only/run_eval_multi.sh
new file mode 100644
index 00000000..dc74c861
--- /dev/null
+++ b/image_classification/FF_Only/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/ff_base.yaml' \
+-dataset='imagenet2012' \
+-batch_size=16 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./linear_base'
diff --git a/image_classification/FF_Only/run_eval_tiny.sh b/image_classification/FF_Only/run_eval_tiny.sh
new file mode 100644
index 00000000..36761236
--- /dev/null
+++ b/image_classification/FF_Only/run_eval_tiny.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/ff_tiny.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='dataset/imagenet' \
+-eval \
+-pretrained='./linear_tiny'
diff --git a/image_classification/FF_Only/run_train_multi_tiny.sh b/image_classification/FF_Only/run_train_multi_tiny.sh
new file mode 100644
index 00000000..5bd0c366
--- /dev/null
+++ b/image_classification/FF_Only/run_train_multi_tiny.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/ff_tiny.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/FF_Only/run_train_tiny.sh b/image_classification/FF_Only/run_train_tiny.sh
new file mode 100644
index 00000000..113af664
--- /dev/null
+++ b/image_classification/FF_Only/run_train_tiny.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/ff_tiny.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/FF_Only/transforms.py b/image_classification/FF_Only/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/FF_Only/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/FF_Only/utils.py b/image_classification/FF_Only/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/FF_Only/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/Focal_Transformer/README.md b/image_classification/Focal_Transformer/README.md
new file mode 100644
index 00000000..f39f4c91
--- /dev/null
+++ b/image_classification/Focal_Transformer/README.md
@@ -0,0 +1,207 @@
+# Focal Self-attention for Local-Global Interactions in Vision Transformers, [arxiv](https://arxiv.org/pdf/2107.00641)
+
+
+PaddlePaddle training/validation code and pretrained models for Focal Transformer.
+
+The official pytorch implementation is [here](https://github.com/microsoft/Focal-Transformer).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<div align=center>
+  <img src='./model.png'>
+  <center> <h3>Focal Transformer Model Overview</h3>  </center>
+  </div>
+
+### Update
+
+- Update(2021-10-21): Code is released and ported weights are uploaded.
+
+## Models Zoo
+
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| Focal-T    					| 82.03 | 95.86 | 28.9M   | 4.9G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HzZJbYH_eIo94h0wLUhqTyJ6AYthNKRh/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JCr2qIA-SZvTqbTO-m2OwA)(i8c2) |
+| Focal-T (use conv)   			| 82.70 | 96.14 | 30.8M   | 4.9G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1PS0-gdXHGl95LqH5k5DG62AH6D3i7v0D/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tVztox4bVJuJEjkD1fLaHQ)(smrk) |
+| Focal-S    					| 83.55 | 96.29 | 51.1M   | 9.4G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1HnVAYsI_hmiomyS4Ax3ccPE7gk4mlTU8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1b7uugAY9RhrgTkUwYcvvow)(dwd8) |
+| Focal-S (use conv)   			| 83.85 | 96.47 | 53.1M   | 9.4G    | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1vcHjYiGNMayoSTPoM8z39XRH6h89TB9V/view?usp=sharing)/[baidu](https://pan.baidu.com/s/174a2aZzCEt3teLuAnIzMtA)(nr7n) |
+| Focal-B    					| 83.98 | 96.48 | 89.8M   | 16.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1bNMegxetWpwZNcmDEC3MHCal6SNXSgWR/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1piBslNhxWR78aQJIdoZjEw)(8akn) |
+| Focal-B (use conv)   			| 84.18 | 96.61 | 93.3M   | 16.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1-J2gDnKrvZGtasvsAYozrbMXR2LtIJ43/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GTLfnTlt6I6drPdfSWB1Iw)(5nfi) |
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+### Models trained from scratch using PaddleViT
+(coming soon)
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**(coming soon)**
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- PaddlePaddle>=2.1.0
+- yacs>=0.1.8
+
+## Data
+`ImageNet2012 dataset` is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./focal_tiny_patch4_window7_224.pdparams`, to use the `focal_tiny_patch4_window7_224` model in python:
+
+```python
+from config import get_config
+from focal_transformer import build_focal as build_model
+# config files in ./configs/
+config = get_config('./configs/focal_tiny_patch4_window7_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./focal_tiny_patch4_window7_224.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate `Focal Transformer` model performance on ImageNet2012 with a `single GPU`, run the following script using command line:
+
+```shell
+sh run_eval.sh
+```
+
+or
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/focal_tiny_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=64 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/focal_tiny_patch4_window7_224  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/focal_tiny_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=32 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/focal_tiny_patch4_window7_224  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the `Focal Transformer` model on ImageNet2012 with `single GPU`, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/focal_tiny_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=32 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/train \
+    -output=./output
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_single_gpu.py \
+    -cfg=./configs/focal_tiny_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=4 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/train \
+    -output=./output
+```
+
+</details>
+
+## Arguments
+- *`-cfg`*: path of model config file (.yaml), stored in `./configs`.
+- *`-dataset`*: dataset name, e.g., `imagenet2012`, `cifar10`, `cifar100`.
+- *`-data_path`*: dataset folder path
+- `-batch_size`: batch size，default: `32`.
+- `-image_size`: input image size，default`224`.
+- `-num_classes`: number of classes, default: `1000`.
+- `-output`: output folder for storing weights and logs，default: `./output`.
+- `-pretrained`: pretrain model weights file path, (`.pdparams` file ext is NOT needed) default: `None`.
+- `-resume`: resume model weight and opt file path, (`.paparams` and `.pdopts` file ext are NOT needed, default: `None`.
+- `-last_epoch`: start epoch，default: `None`.
+- `-save_freq`: number of epochs to save checkpoint，default: `1`.
+- `-log_freq`: number of iters to print logging，default: `100`.
+- `-validate_freq`: number of epochs to do validation during training，default: `10`.
+- `-accum_iter`: number of iteration for iter accumulation, default: 1.
+- `-num_workers`: number of workers for data loading，default: `1`.
+- `-ngpus`: number of GPUs to use，you can control GPUs by CUDA_VISIBLE_DEVICES, just set this to -1 default: `-1`.
+- `-eval`: start eval mode.
+- `-amp`: start amp training.
+
+> `-cfg`,`-dataset` and `-data_path` in `main_single_gpu.py` and `main_multi_gpu.py` are MUST-HAVE settings.
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@misc{yang2021focal,
+    title={Focal Self-attention for Local-Global Interactions in Vision Transformers}, 
+    author={Jianwei Yang and Chunyuan Li and Pengchuan Zhang and Xiyang Dai and Bin Xiao and Lu Yuan and Jianfeng Gao},
+    year={2021},
+    eprint={2107.00641},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/image_classification/Focal_Transformer/__init__.py b/image_classification/Focal_Transformer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/Focal_Transformer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/Focal_Transformer/augment.py b/image_classification/Focal_Transformer/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/Focal_Transformer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/Focal_Transformer/config.py b/image_classification/Focal_Transformer/config.py
new file mode 100644
index 00000000..2bb5e081
--- /dev/null
+++ b/image_classification/Focal_Transformer/config.py
@@ -0,0 +1,240 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+"""
+
+import os
+import yaml
+from yacs.config import CfgNode as CN
+
+_C = CN()
+# Base config files
+_C.BASE = ['']
+
+# -----------------------------------------------------------------------------
+# Data settings -- ok
+# -----------------------------------------------------------------------------
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 128 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 128 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/home/aistudio/ILSVRC2012_val' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+
+# -----------------------------------------------------------------------------
+# Model settings -- maybe ok
+# -----------------------------------------------------------------------------
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'focal'
+_C.MODEL.NAME = 'focal_tiny_patch4_window7_224'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+
+# Focal Transformer parameters
+# These hyperparams are the same to Swin Transformer, but we do not use shift by default
+_C.MODEL.FOCAL = CN()
+_C.MODEL.FOCAL.PATCH_SIZE = 4
+_C.MODEL.FOCAL.IN_CHANS = 3
+_C.MODEL.FOCAL.EMBED_DIM = 96
+_C.MODEL.FOCAL.DEPTHS = [2, 2, 6, 2]
+_C.MODEL.FOCAL.NUM_HEADS = [3, 6, 12, 24]
+_C.MODEL.FOCAL.WINDOW_SIZE = 7
+_C.MODEL.FOCAL.MLP_RATIO = 4.
+_C.MODEL.FOCAL.QKV_BIAS = True
+_C.MODEL.FOCAL.QK_SCALE = False
+_C.MODEL.FOCAL.APE = False
+_C.MODEL.FOCAL.PATCH_NORM = True
+_C.MODEL.FOCAL.USE_SHIFT = False
+
+
+# Below are specifical for Focal Transformers
+_C.MODEL.FOCAL.FOCAL_POOL = "none"
+_C.MODEL.FOCAL.FOCAL_STAGES = [0, 1, 2, 3]
+_C.MODEL.FOCAL.FOCAL_LEVELS = [1, 1, 1, 1]
+_C.MODEL.FOCAL.FOCAL_WINDOWS = [7, 5, 3, 1]
+_C.MODEL.FOCAL.EXPAND_STAGES = [0, 1, 2, 3]
+_C.MODEL.FOCAL.EXPAND_SIZES = [3, 3, 3, 3]
+_C.MODEL.FOCAL.EXPAND_LAYER = "all"
+_C.MODEL.FOCAL.USE_CONV_EMBED = False
+_C.MODEL.FOCAL.USE_LAYERSCALE = False
+_C.MODEL.FOCAL.USE_PRE_NORM = False
+
+
+# -----------------------------------------------------------------------------
+# Training settings
+# -----------------------------------------------------------------------------
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 5e-4
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+# LR scheduler
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine' # origin is cosine
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+
+# Optimizer
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+
+# -----------------------------------------------------------------------------
+# Augmentation settings
+# -----------------------------------------------------------------------------
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # How to apply mixup/cutmix params. Per "batch", "pair", or "elem"
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# -----------------------------------------------------------------------------
+# Misc
+# -----------------------------------------------------------------------------
+_C.TEST = CN()
+_C.TEST.CROP = True   # 预测时，是否使用裁剪
+
+# -----------------------------------------------------------------------------
+# Misc
+# -----------------------------------------------------------------------------
+_C.AMP = False
+_C.SAVE = "./output"
+_C.TAG = 'default'
+_C.SAVE_FREQ = 1 # Frequency to save checkpoint
+_C.REPORT_FREQ  = 100 # Frequency to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0 # Fixed random seed
+_C.EVAL = False
+_C.THROUGHPUT_MODE = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as f:
+        yaml_cfg = yaml.load(f, Loader=yaml.FullLoader)
+
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('=> merge config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    # merge from specific arguments
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.output is not None:
+        config.SAVE = args.output 
+    if args.save_freq:
+        config.SAVE_FREQ = args.save_freq
+    if args.log_freq:
+        config.REPORT_FREQ = args.log_freq
+    if args.validate_freq:
+        config.VALIDATE_FREQ = args.validate_freq 
+    if args.num_workers:
+        config.DATA.NUM_WORKERS = args.num_workers
+    if args.accum_iter: 
+        config.TRAIN.ACCUM_ITER = args.accum_iter
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+    
+    # output folder
+    config.SAVE = os.path.join(config.SAVE, config.MODEL.NAME, config.TAG)
+    # config.freeze()
+
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/Focal_Transformer/configs/focal_base_patch4_window7_224.yaml b/image_classification/Focal_Transformer/configs/focal_base_patch4_window7_224.yaml
new file mode 100644
index 00000000..225fc2aa
--- /dev/null
+++ b/image_classification/Focal_Transformer/configs/focal_base_patch4_window7_224.yaml
@@ -0,0 +1,14 @@
+MODEL:
+  TYPE: focal
+  NAME: focal_base_patch4_window7_224
+  DROP_PATH: 0.5
+  FOCAL:
+    EMBED_DIM: 128
+    DEPTHS: [2, 2, 18, 2]
+    NUM_HEADS: [4, 8, 16, 32]
+    WINDOW_SIZE: 7
+    FOCAL_POOL: "fc"
+    FOCAL_STAGES: [0, 1, 2, 3]
+    FOCAL_LEVELS: [2, 2, 2, 2]
+    FOCAL_WINDOWS: [7, 5, 3, 1]
+    EXPAND_SIZES: [3, 3, 3, 3]
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/configs/focal_base_useconv_patch4_window7_224.yaml b/image_classification/Focal_Transformer/configs/focal_base_useconv_patch4_window7_224.yaml
new file mode 100644
index 00000000..5459dcf1
--- /dev/null
+++ b/image_classification/Focal_Transformer/configs/focal_base_useconv_patch4_window7_224.yaml
@@ -0,0 +1,15 @@
+MODEL:
+  TYPE: focal
+  NAME: focal_base_patch4_window7_224
+  DROP_PATH: 0.5
+  FOCAL:
+    USE_CONV_EMBED: True
+    EMBED_DIM: 128
+    DEPTHS: [2, 2, 18, 2]
+    NUM_HEADS: [4, 8, 16, 32]
+    WINDOW_SIZE: 7
+    FOCAL_POOL: "fc"
+    FOCAL_STAGES: [0, 1, 2, 3]
+    FOCAL_LEVELS: [2, 2, 2, 2]
+    FOCAL_WINDOWS: [7, 5, 3, 1]
+    EXPAND_SIZES: [3, 3, 3, 3]
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/configs/focal_small_patch4_window7_224.yaml b/image_classification/Focal_Transformer/configs/focal_small_patch4_window7_224.yaml
new file mode 100644
index 00000000..2f66b304
--- /dev/null
+++ b/image_classification/Focal_Transformer/configs/focal_small_patch4_window7_224.yaml
@@ -0,0 +1,14 @@
+MODEL:
+  TYPE: focal
+  NAME: focal_small_patch4_window7_224
+  DROP_PATH: 0.3
+  FOCAL:
+    EMBED_DIM: 96
+    DEPTHS: [2, 2, 18, 2]
+    NUM_HEADS: [3, 6, 12, 24]
+    WINDOW_SIZE: 7
+    FOCAL_POOL: "fc"
+    FOCAL_STAGES: [0, 1, 2, 3]
+    FOCAL_LEVELS: [2, 2, 2, 2]
+    FOCAL_WINDOWS: [7, 5, 3, 1]
+    EXPAND_SIZES: [3, 3, 3, 3]
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/configs/focal_small_useconv_patch4_window7_224.yaml b/image_classification/Focal_Transformer/configs/focal_small_useconv_patch4_window7_224.yaml
new file mode 100644
index 00000000..698ae7d0
--- /dev/null
+++ b/image_classification/Focal_Transformer/configs/focal_small_useconv_patch4_window7_224.yaml
@@ -0,0 +1,15 @@
+MODEL:
+  TYPE: focal
+  NAME: focal_small_patch4_window7_224
+  DROP_PATH: 0.3
+  FOCAL:
+    USE_CONV_EMBED: True
+    EMBED_DIM: 96
+    DEPTHS: [2, 2, 18, 2]
+    NUM_HEADS: [3, 6, 12, 24]
+    WINDOW_SIZE: 7
+    FOCAL_POOL: "fc"
+    FOCAL_STAGES: [0, 1, 2, 3]
+    FOCAL_LEVELS: [2, 2, 2, 2]
+    FOCAL_WINDOWS: [7, 5, 3, 1]
+    EXPAND_SIZES: [3, 3, 3, 3]
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/configs/focal_tiny_patch4_window7_224.yaml b/image_classification/Focal_Transformer/configs/focal_tiny_patch4_window7_224.yaml
new file mode 100644
index 00000000..f9387a87
--- /dev/null
+++ b/image_classification/Focal_Transformer/configs/focal_tiny_patch4_window7_224.yaml
@@ -0,0 +1,14 @@
+MODEL:
+  TYPE: focal
+  NAME: focal_tiny_patch4_window7_224
+  DROP_PATH: 0.2
+  FOCAL:
+    EMBED_DIM: 96
+    DEPTHS: [2, 2, 6, 2]
+    NUM_HEADS: [3, 6, 12, 24]
+    WINDOW_SIZE: 7
+    FOCAL_POOL: "fc"
+    FOCAL_STAGES: [0, 1, 2, 3]
+    FOCAL_LEVELS: [2, 2, 2, 2]
+    FOCAL_WINDOWS: [7, 5, 3, 1]
+    EXPAND_SIZES: [3, 3, 3, 3]
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/configs/focal_tiny_useconv_patch4_window7_224.yaml b/image_classification/Focal_Transformer/configs/focal_tiny_useconv_patch4_window7_224.yaml
new file mode 100644
index 00000000..eb3739c4
--- /dev/null
+++ b/image_classification/Focal_Transformer/configs/focal_tiny_useconv_patch4_window7_224.yaml
@@ -0,0 +1,15 @@
+MODEL:
+  TYPE: focal
+  NAME: focal_tiny_patch4_window7_224
+  DROP_PATH: 0.2
+  FOCAL:
+    USE_CONV_EMBED: True
+    EMBED_DIM: 96
+    DEPTHS: [2, 2, 6, 2]
+    NUM_HEADS: [3, 6, 12, 24]
+    WINDOW_SIZE: 7
+    FOCAL_POOL: "fc"
+    FOCAL_STAGES: [0, 1, 2, 3]
+    FOCAL_LEVELS: [2, 2, 2, 2]
+    FOCAL_WINDOWS: [7, 5, 3, 1]
+    EXPAND_SIZES: [3, 3, 3, 3]
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/datasets.py b/image_classification/Focal_Transformer/datasets.py
new file mode 100644
index 00000000..cc793941
--- /dev/null
+++ b/image_classification/Focal_Transformer/datasets.py
@@ -0,0 +1,216 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    if config.TEST.CROP:
+        scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+        transforms_val = transforms.Compose([
+            transforms.Resize(scale_size, interpolation='bicubic'),
+            transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+        ])
+    else:
+        transforms_val = transforms.Compose([
+            transforms.Resize(config.DATA.IMAGE_SIZE, interpolation='bicubic'),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+        ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+    Returns the related dataset object according to configs and mode(train/val)
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+        Multi-GPU loader is implements as distributedBatchSampler.
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/Focal_Transformer/focal_transformer.py b/image_classification/Focal_Transformer/focal_transformer.py
new file mode 100644
index 00000000..d93cc4a7
--- /dev/null
+++ b/image_classification/Focal_Transformer/focal_transformer.py
@@ -0,0 +1,1180 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.nn import functional as F
+
+class DropPath(nn.Layer):
+    r"""DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, set if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        # divide is to keep same output expectation
+        output = inputs.divide(keep_prob) * random_tensor
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+class Identity(nn.Layer):
+    r""" Identity layer
+        The output of this layer is the input without any change.
+        Use this layer to avoid using 'if' condition in forward methods
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    r""" MLP module
+    """
+    def __init__(self, in_features, hidden_features=None,
+                 out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+
+        weight_attr, bias_attr = self._init_weights()
+
+        self.fc1 = nn.Linear(in_features, hidden_features,
+                             weight_attr=weight_attr, bias_attr=bias_attr)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features,
+                             weight_attr=weight_attr, bias_attr=bias_attr)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+def window_partition(x, window_size):
+    r"""window_partition
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+    Returns:
+        windows: (num_windows*B, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.reshape((B, H // window_size, window_size, W // window_size, window_size, C))
+    windows = x.transpose((0, 1, 3, 2, 4, 5)).reshape((-1, window_size, window_size, C))
+    return windows
+
+
+def window_partition_noreshape(x, window_size):
+    r"""window_partition_noreshape
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+    Returns:
+        windows: (B, num_windows_h, num_windows_w, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.reshape((B, H // window_size, window_size, W // window_size, window_size, C))
+    windows = x.transpose((0, 1, 3, 2, 4, 5))
+    return windows
+
+
+def window_reverse(windows, window_size, H, W):
+    r"""window_reverse
+    Args:
+        windows: (num_windows*B, window_size, window_size, C)
+        window_size (int): Window size
+        H (int): Height of image
+        W (int): Width of image
+    Returns:
+        x: (B, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.reshape((B, H // window_size, W // window_size, window_size, window_size, -1))
+    x = x.transpose((0, 1, 3, 2, 4, 5)).reshape((B, H, W, -1))
+    return x
+
+
+def get_relative_position_index(q_windows, k_windows):
+    r"""
+    Args:
+        q_windows: tuple (query_window_height, query_window_width)
+        k_windows: tuple (key_window_height, key_window_width)
+    Returns:
+        relative_position_index:
+            query_window_height*query_window_width, key_window_height*key_window_width
+    """
+    # get pair-wise relative position index for each token inside the window
+    coords_h_q = paddle.arange(q_windows[0])
+    coords_w_q = paddle.arange(q_windows[1])
+    coords_q = paddle.stack(paddle.meshgrid([coords_h_q, coords_w_q]))  # 2, Wh_q, Ww_q
+
+    coords_h_k = paddle.arange(k_windows[0])
+    coords_w_k = paddle.arange(k_windows[1])
+    coords_k = paddle.stack(paddle.meshgrid([coords_h_k, coords_w_k]))  # 2, Wh, Ww
+
+    coords_flatten_q = paddle.flatten(coords_q, 1)  # 2, Wh_q*Ww_q
+    coords_flatten_k = paddle.flatten(coords_k, 1)  # 2, Wh_k*Ww_k
+
+    coords_flatten_q = paddle.unsqueeze(coords_flatten_q, axis=-1) # 2, Wh_q*Ww_q, 1
+    coords_flatten_k = paddle.unsqueeze(coords_flatten_k, axis=-2) # 2, 1, Ww_k*Ww_k
+
+    relative_coords = coords_flatten_q - coords_flatten_k  # 2, Wh_q*Ww_q, Wh_k*Ww_k
+    relative_coords = relative_coords.transpose((1, 2, 0))  # Wh_q*Ww_q, Wh_k*Ww_k, 2
+    relative_coords[:, :, 0] += k_windows[0] - 1  # shift to start from 0
+    relative_coords[:, :, 1] += k_windows[1] - 1
+    relative_coords[:, :, 0] *= (q_windows[1] + k_windows[1]) - 1
+    relative_position_index = relative_coords.sum(-1)  #  Wh_q*Ww_q, Wh_k*Ww_k
+    return relative_position_index
+
+
+class WindowAttention(nn.Layer):
+    r""" Window based multi-head self attention (W-MSA) module with relative position bias.
+    Args:
+        dim (int): Number of input channels.
+        expand_size (int): The expand size at focal level 1.
+        window_size (tuple[int]): The height and width of the window.
+        focal_window (int): Focal region size.
+        focal_level (int): Focal attention level.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value.
+                                    Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set
+        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0
+        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
+        pool_method (str): window pooling method. Default: none
+    """
+    def __init__(self, dim, expand_size, window_size, focal_window,
+                    focal_level, num_heads, qkv_bias=True, qk_scale=None,
+                    attn_drop=0., proj_drop=0., pool_method="none"):
+        super().__init__()
+        self.dim = dim
+        self.expand_size = expand_size
+        self.window_size = window_size  # Wh, Ww
+        self.pool_method = pool_method
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+
+        weight_attr, bias_attr = self._init_weights()
+
+        # define a parameter table of relative position bias for each window
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads),
+            dtype=np.float32, is_bias=True)  # 2*Wh-1 * 2*Ww-1, nH
+
+        # get pair-wise relative position index for each token inside the window
+        coords_h = paddle.arange(self.window_size[0])
+        coords_w = paddle.arange(self.window_size[1])
+        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
+
+        coords_flatten_l = paddle.unsqueeze(coords_flatten, axis=-1) # 2, Wh*Ww, 1
+        coords_flatten_r = paddle.unsqueeze(coords_flatten, axis=-2) # 2, 1, Wh*Ww
+        relative_coords = coords_flatten_l - coords_flatten_r  # 2, Wh*Ww, Wh*Ww
+
+        relative_coords = relative_coords.transpose((1, 2, 0))  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer("relative_position_index", relative_position_index)
+
+        if self.expand_size > 0 and focal_level > 0:
+            # define a parameter table of position bias between window
+            # and its fine-grained surroundings
+            self.window_size_of_key = self.window_size[0] * \
+                self.window_size[1] if self.expand_size == 0 else \
+                (4 * self.window_size[0] * self.window_size[1] - 4 * \
+                (self.window_size[0] -  self.expand_size) * \
+                (self.window_size[0] -  self.expand_size))
+
+            self.relative_position_bias_table_to_neighbors = paddle.create_parameter(
+                        shape=(1, num_heads,
+                        self.window_size[0] * self.window_size[1], self.window_size_of_key),
+                        dtype=np.float32, is_bias=True,
+                        attr=nn.initializer.TruncatedNormal(std=.02))  # Wh*Ww, nH, nSurrounding
+
+            # get mask for rolled k and rolled v
+            mask_tl = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_tl[:-self.expand_size, :-self.expand_size] = 0
+            mask_tr = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_tr[:-self.expand_size, self.expand_size:] = 0
+            mask_bl = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_bl[self.expand_size:, :-self.expand_size] = 0
+            mask_br = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_br[self.expand_size:, self.expand_size:] = 0
+            mask_rolled = paddle.stack((mask_tl, mask_tr, mask_bl, mask_br), 0).flatten(0)
+            self.register_buffer("valid_ind_rolled", paddle.flatten(mask_rolled.nonzero()))
+
+        if pool_method != "none" and focal_level > 1:
+            self.relative_position_bias_table_to_windows = nn.ParameterList()
+            self.unfolds = nn.LayerList()
+
+            # build relative position bias between local patch and pooled windows
+            for k in range(focal_level-1):
+                stride = 2**k
+                kernel_size = 2*(self.focal_window // 2) + 2**k + (2**k-1)
+                # define unfolding operations
+                self.unfolds.append(
+                    nn.Unfold(
+                    kernel_sizes=[kernel_size, kernel_size],
+                    strides=stride, paddings=kernel_size // 2)
+                )
+
+                # define relative position bias table
+                relative_position_bias_table_to_windows = paddle.create_parameter(
+                        shape=(self.num_heads,
+                        (self.window_size[0] + self.focal_window + 2**k - 2) * \
+                        (self.window_size[1] + self.focal_window + 2**k - 2), ),
+                        dtype=np.float32, is_bias=True,
+                        attr=nn.initializer.TruncatedNormal(std=.02))  # Wh*Ww, nH, nSurrounding
+                self.relative_position_bias_table_to_windows.append(
+                            relative_position_bias_table_to_windows)
+
+                # define relative position bias index
+                relative_position_index_k = get_relative_position_index(self.window_size,
+                                            (self.focal_window + 2**k - 1,
+                                            self.focal_window + 2**k - 1))
+                self.register_buffer("relative_position_index_{}".format(k),
+                                                    relative_position_index_k)
+
+                # define unfolding index for focal_level > 0
+                if k > 0:
+                    mask = paddle.zeros((kernel_size, kernel_size))
+                    mask[(2**k)-1:, (2**k)-1:] = 1
+                    self.register_buffer("valid_ind_unfold_{}".format(k),
+                                paddle.flatten(mask.flatten(0).nonzero()))
+
+        self.qkv = nn.Linear(dim, dim * 3, weight_attr=weight_attr,
+                             bias_attr=bias_attr if qkv_bias else False)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def forward(self, x_all, mask_all=None):
+        """
+        Args:
+            x_all (list[Tensors]): input features at different granularity
+            mask_all (list[Tensors/None]): masks for input features at different granularity
+        """
+        x = x_all[0]
+
+        B, nH, nW, C = x.shape
+        qkv = self.qkv(x).reshape((B, nH, nW, 3, C)).transpose((3, 0, 1, 2, 4))
+        q, k, v = qkv[0], qkv[1], qkv[2]  # B, nH, nW, C
+
+
+        # partition q map
+        q_windows = window_partition(q, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+        k_windows = window_partition(k, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+        v_windows = window_partition(v, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+
+        if self.expand_size > 0 and self.focal_level > 0:
+            k_tl = paddle.roll(k, shifts=(-self.expand_size, -self.expand_size), axis=(1, 2))
+            v_tl = paddle.roll(v, shifts=(-self.expand_size, -self.expand_size), axis=(1, 2))
+
+            k_tr = paddle.roll(k, shifts=(-self.expand_size, self.expand_size), axis=(1, 2))
+            v_tr = paddle.roll(v, shifts=(-self.expand_size, self.expand_size), axis=(1, 2))
+
+            k_bl = paddle.roll(k, shifts=(self.expand_size, -self.expand_size), axis=(1, 2))
+            v_bl = paddle.roll(v, shifts=(self.expand_size, -self.expand_size), axis=(1, 2))
+
+            k_br = paddle.roll(k, shifts=(self.expand_size, self.expand_size), axis=(1, 2))
+            v_br = paddle.roll(v, shifts=(self.expand_size, self.expand_size), axis=(1, 2))
+
+
+            k_tl_windows = window_partition(k_tl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_tr_windows = window_partition(k_tr, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_bl_windows = window_partition(k_bl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_br_windows = window_partition(k_br, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+
+            v_tl_windows = window_partition(v_tl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_tr_windows = window_partition(v_tr, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_bl_windows = window_partition(v_bl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_br_windows = window_partition(v_br, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+
+            k_rolled = paddle.concat((k_tl_windows, k_tr_windows,
+                       k_bl_windows, k_br_windows), 1).transpose((0, 2, 1, 3))
+            v_rolled = paddle.concat((v_tl_windows, v_tr_windows,
+                       v_bl_windows, v_br_windows), 1).transpose((0, 2, 1, 3))
+
+            # mask out tokens in current window
+            k_rolled = paddle.gather(k_rolled, self.valid_ind_rolled.flatten(), axis=2)
+            v_rolled = paddle.gather(v_rolled, self.valid_ind_rolled.flatten(), axis=2)
+            k_rolled = paddle.concat((k_windows, k_rolled), 2)
+            v_rolled = paddle.concat((v_windows, v_rolled), 2)
+        else:
+            k_rolled = k_windows
+            v_rolled = v_windows
+
+        if self.pool_method != "none" and self.focal_level > 1:
+            k_pooled = []
+            v_pooled = []
+            for k in range(self.focal_level-1):
+                stride = 2**k
+                x_window_pooled = x_all[k+1]  # B, nWh, nWw, C
+                nWh, nWw = x_window_pooled.shape[1:3] 
+
+                # generate mask for pooled windows
+                mask = paddle.ones(shape=(nWh, nWw)).astype(x_window_pooled.dtype)
+                unfolded_mask = self.unfolds[k](mask.unsqueeze(0).unsqueeze(1)).reshape((
+                    1, 1, self.unfolds[k].kernel_sizes[0],
+                    self.unfolds[k].kernel_sizes[1], -1)).transpose((0, 4, 2, 3, 1)).\
+                    reshape((nWh*nWw // stride // stride, -1, 1))
+
+                if k > 0:
+                    valid_ind_unfold_k = getattr(self, "valid_ind_unfold_{}".format(k))
+                    unfolded_mask = paddle.gather(unfolded_mask, valid_ind_unfold_k, axis=1)
+                    # unfolded_mask = unfolded_mask[:, valid_ind_unfold_k]
+
+                x_window_masks = unfolded_mask.flatten(1).unsqueeze(0)
+                # from numpy to paddle
+                x_window_masks = x_window_masks.numpy()
+                x_window_masks[x_window_masks==0] = -100.0
+                x_window_masks[x_window_masks>0] = 0.0
+                x_window_masks = paddle.to_tensor(x_window_masks.astype(np.float32))         
+                mask_all[k+1] = x_window_masks
+
+                # generate k and v for pooled windows                
+                qkv_pooled = self.qkv(x_window_pooled).reshape((B, nWh, nWw, 3, C)).transpose(
+                                                                              (3, 0, 4, 1, 2))
+                k_pooled_k, v_pooled_k = qkv_pooled[1], qkv_pooled[2]  # B, C, nWh, nWw
+
+                # (B x (nH*nW)) x nHeads x (unfold_wsize x unfold_wsize) x head_dim
+                k_pooled_k = self.unfolds[k](k_pooled_k).reshape((
+                            B, C, self.unfolds[k].kernel_sizes[0],
+                            self.unfolds[k].kernel_sizes[1], -1)).transpose(
+                            (0, 4, 2, 3, 1)).reshape((-1,
+                            self.unfolds[k].kernel_sizes[0]*self.unfolds[k].kernel_sizes[1],
+                            self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+                v_pooled_k = self.unfolds[k](v_pooled_k).reshape((
+                            B, C, self.unfolds[k].kernel_sizes[0],
+                            self.unfolds[k].kernel_sizes[1], -1)).transpose(
+                            (0, 4, 2, 3, 1)).reshape((-1,
+                            self.unfolds[k].kernel_sizes[0]*self.unfolds[k].kernel_sizes[1],
+                            self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+
+                if k > 0:
+                    k_pooled_k = paddle.gather(k_pooled_k, valid_ind_unfold_k, axis=2)
+                    v_pooled_k = paddle.gather(v_pooled_k, valid_ind_unfold_k, axis=2)
+                    # k_pooled_k = k_pooled_k[:, :, valid_ind_unfold_k]
+                    # v_pooled_k = v_pooled_k[:, :, valid_ind_unfold_k]
+
+                k_pooled += [k_pooled_k]
+                v_pooled += [v_pooled_k]
+            k_all = paddle.concat([k_rolled] + k_pooled, 2)
+            v_all = paddle.concat([v_rolled] + v_pooled, 2)
+        else:
+            k_all = k_rolled
+            v_all = v_rolled
+
+        N = k_all.shape[-2]
+        q_windows = q_windows * self.scale
+        # B*nW, nHead, window_size*window_size, focal_window_size*focal_window_size
+        attn = (paddle.mm(q_windows, k_all.transpose((0, 1, 3, 2))))
+
+        window_area = self.window_size[0] * self.window_size[1]        
+        window_area_rolled = k_rolled.shape[2]
+
+        # add relative position bias for tokens inside window
+        # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.flatten()].reshape((
+            self.window_size[0] * self.window_size[1], 
+            self.window_size[0] * self.window_size[1], -1))
+        # nH, Wh*Ww, Wh*Ww
+        relative_position_bias = relative_position_bias.transpose((2, 0, 1))
+        attn[:, :, :window_area, :window_area] = attn[:, :, :window_area, :window_area] + \
+                                                 relative_position_bias.unsqueeze(0)
+
+        # add relative position bias for patches inside a window
+        if self.expand_size > 0 and self.focal_level > 0:
+            attn[:, :, :window_area, window_area:window_area_rolled] = attn[:, :, :window_area,
+                window_area:window_area_rolled] + self.relative_position_bias_table_to_neighbors
+
+        if self.pool_method != "none" and self.focal_level > 1:
+            # add relative position bias for different windows in an image        
+            offset = window_area_rolled
+            for k in range(self.focal_level-1):
+                # add relative position bias
+                relative_position_index_k = getattr(self, 'relative_position_index_{}'.format(k))
+                relative_position_bias_to_windows = self.relative_position_bias_table_to_windows[k]
+                relative_position_bias_to_windows = paddle.gather(
+                    relative_position_bias_to_windows, relative_position_index_k.flatten(),
+                    axis=1).reshape((-1, self.window_size[0] * self.window_size[1],
+                    (self.focal_window+2**k-1)**2,
+                )) # nH, NWh*NWw,focal_region*focal_region
+                attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] = \
+                    attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] + \
+                    relative_position_bias_to_windows.unsqueeze(0)
+                # add attentional mask
+                if mask_all[k+1] is not None:
+                    attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] = \
+                                    attn[:, :, :window_area, offset:(offset + \
+                                    (self.focal_window+2**k-1)**2)] + \
+                                    paddle.stack([mask_all[k+1].unsqueeze(-2).unsqueeze(-2)] * \
+                                    (attn.shape[0] // mask_all[k+1].shape[1]), axis=0).\
+                                    reshape((-1, 1, 1, mask_all[k+1].shape[-1]))
+                offset += (self.focal_window+2**k-1)**2
+
+        if mask_all[0] is not None:
+            nW = mask_all[0].shape[0]
+            attn = attn.reshape((attn.shape[0] // nW, nW, self.num_heads, window_area, N))
+            attn[:, :, :, :, :window_area] = attn[:, :, :, :, :window_area] + \
+                                             mask_all[0].unsqueeze(0).unsqueeze(2)
+            attn = attn.reshape((-1, self.num_heads, window_area, N))
+            attn = self.softmax(attn)
+        else:          
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+        x = paddle.mm(attn, v_all).transpose((0, 2, 1, 3)).reshape(
+                                   (attn.shape[0], window_area, C))
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class FocalTransformerBlock(nn.Layer):
+    r""" Focal Transformer Block.
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resulotion.
+        num_heads (int): Number of attention heads.
+        window_size (int): Window size.
+        expand_size (int): expand size at first focal level (finest level).
+        shift_size (int): Shift size for SW-MSA.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value.
+                                   Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float, optional): Stochastic depth rate. Default: 0.0
+        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+        pool_method (str): window pooling method. Default: none, options: [none|fc|conv]
+        focal_level (int): number of focal levels. Default: 1.
+        focal_window (int): region size of focal attention. Default: 1
+        use_layerscale (bool): whether use layer scale for training stability. Default: False
+        layerscale_value (float): scaling value for layer scale. Default: 1e-4
+    """
+    def __init__(self, dim, input_resolution, num_heads, window_size=7, expand_size=0,
+                 shift_size=0, mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0.,
+                 attn_drop=0., drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,
+                 pool_method="none", focal_level=1, focal_window=1, use_layerscale=False,
+                 layerscale_value=1e-4):
+        super(FocalTransformerBlock, self).__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.expand_size = expand_size
+        self.mlp_ratio = mlp_ratio
+        self.pool_method = pool_method
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        self.use_layerscale = use_layerscale
+
+        weight_attr, bias_attr = self._init_weights()
+
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't partition windows
+            self.expand_size = 0
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"
+
+        self.window_size_glo = self.window_size
+
+        self.pool_layers = nn.LayerList()
+        if self.pool_method != "none":
+            for k in range(self.focal_level-1):
+                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
+                if self.pool_method == "fc":
+                    self.pool_layers.append(nn.Linear(window_size_glo * window_size_glo, 1,
+                                            weight_attr=weight_attr, bias_attr=bias_attr))
+                    self.pool_layers[len(self.pool_layers)-1].weight.set_value(
+                        paddle.full_like(self.pool_layers[len(self.pool_layers)-1].weight,
+                        1./(window_size_glo * window_size_glo))
+                    )
+                    self.pool_layers[len(self.pool_layers)-1].bias.set_value(
+                        paddle.full_like(self.pool_layers[len(self.pool_layers)-1].bias, 0)
+                    )
+                    
+                elif self.pool_method == "conv":
+                    self.pool_layers.append(nn.Conv2D(dim, dim,
+                                            kernel_size=window_size_glo,
+                                            stride=window_size_glo, groups=dim))
+
+        self.norm1 = norm_layer(dim,
+                     weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                     bias_attr=bias_attr)
+
+        self.attn = WindowAttention(
+            dim, expand_size=self.expand_size,
+            window_size=(self.window_size, self.window_size),
+            focal_window=focal_window, focal_level=focal_level,
+            num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+            attn_drop=attn_drop,proj_drop=drop, pool_method=pool_method)
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
+        self.norm2 = norm_layer(dim,
+                     weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                     bias_attr=bias_attr)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim,
+                   act_layer=act_layer, drop=drop)
+
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = paddle.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            # nW, window_size, window_size, 1
+            mask_windows = window_partition(img_mask, self.window_size)
+            mask_windows = mask_windows.reshape((-1, self.window_size * self.window_size))
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            # from numpy to paddle
+            attn_mask = attn_mask.numpy()
+            attn_mask[attn_mask!=0] = -100.0
+            attn_mask[attn_mask==0] = 0.0
+            attn_mask = paddle.to_tensor(attn_mask.astype(np.float32))
+        else:
+            attn_mask = None
+        self.register_buffer("attn_mask", attn_mask)
+
+        if self.use_layerscale:
+            self.gamma_1 = paddle.create_parameter(layerscale_value * paddle.ones((dim)))
+            self.gamma_2 = paddle.create_parameter(layerscale_value * paddle.ones((dim)))
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, "input feature has wrong size"
+
+        shortcut = x
+        x = self.norm1(x)
+        x = x.reshape((B, H, W, C))
+
+        # pad feature maps to multiples of window size
+        pad_l = pad_t = 0
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+        if pad_r > 0 or pad_b > 0:
+            x = F.pad(x, [0, 0, pad_l, pad_r, pad_t, pad_b, 0, 0])
+
+        B, H, W, C = x.shape 
+
+        if self.shift_size > 0:
+            shifted_x = paddle.roll(x, shifts=(-self.shift_size, -self.shift_size), axis=(1, 2))
+        else:
+            shifted_x = x
+
+        x_windows_all = [shifted_x]
+        x_window_masks_all = [self.attn_mask]
+
+        if self.focal_level > 1 and self.pool_method != "none":
+            # if we add coarser granularity and the pool method is not none
+            for k in range(self.focal_level-1):
+                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
+                pooled_h = math.ceil(H / self.window_size) * (2 ** k)
+                pooled_w = math.ceil(W / self.window_size) * (2 ** k)
+                H_pool = pooled_h * window_size_glo
+                W_pool = pooled_w * window_size_glo
+
+                x_level_k = shifted_x
+                # trim or pad shifted_x depending on the required size
+                if H > H_pool:
+                    trim_t = (H - H_pool) // 2
+                    trim_b = H - H_pool - trim_t
+                    x_level_k = x_level_k[:, trim_t:-trim_b]
+                elif H < H_pool:
+                    pad_t = (H_pool - H) // 2
+                    pad_b = H_pool - H - pad_t
+                    x_level_k = F.pad(x_level_k, [0, 0, 0, 0, pad_t, pad_b, 0, 0])
+
+                if W > W_pool:
+                    trim_l = (W - W_pool) // 2
+                    trim_r = W - W_pool - trim_l
+                    x_level_k = x_level_k[:, :, trim_l:-trim_r]
+                elif W < W_pool:
+                    pad_l = (W_pool - W) // 2
+                    pad_r = W_pool - W - pad_l
+                    x_level_k = F.pad(x_level_k, [0, 0, pad_l, pad_r, 0, 0, 0, 0])
+
+                # B, nw, nw, window_size, window_size, C
+                x_windows_noreshape = window_partition_noreshape(x_level_k, window_size_glo)
+                nWh, nWw = x_windows_noreshape.shape[1:3]
+
+                if self.pool_method == "mean":
+                    # B, nWh, nWw, C
+                    x_windows_pooled = x_windows_noreshape.mean([3, 4])
+                elif self.pool_method == "max":
+                    # B, nWh, nWw, C
+                    x_windows_pooled = x_windows_noreshape.max(-2)[0].max(-2)[0].reshape(
+                                       (B, nWh, nWw, C))
+                elif self.pool_method == "fc":
+                    # B, nWh, nWw, C, wsize**2
+                    x_windows_noreshape = x_windows_noreshape.reshape((B, nWh, nWw,
+                                          window_size_glo*window_size_glo, C)).transpose(
+                                          (0, 1, 2, 4, 3))
+                    # B, nWh, nWw, C
+                    x_windows_pooled = self.pool_layers[k](x_windows_noreshape).flatten(-2)
+                elif self.pool_method == "conv":
+                    # B * nw * nw, C, wsize, wsize
+                    x_windows_noreshape = x_windows_noreshape.reshape((-1,
+                                          window_size_glo, window_size_glo, C)).transpose(
+                                          (0, 3, 1, 2))
+                    # B, nWh, nWw, C
+                    x_windows_pooled = self.pool_layers[k](x_windows_noreshape).reshape(
+                                       (B, nWh, nWw, C))
+
+                x_windows_all += [x_windows_pooled]
+                x_window_masks_all += [None]
+        
+        # nW*B, window_size*window_size, C
+        attn_windows = self.attn(x_windows_all, mask_all=x_window_masks_all)
+        attn_windows = attn_windows[:, :self.window_size ** 2]
+        
+        x = self.merge_windows_and_ffn(attn_windows, shortcut, B, C, H, W)
+
+        return x
+
+
+    def merge_windows_and_ffn(self, attn_windows, shortcut, B, C, H, W):
+        attn_windows = attn_windows.reshape((-1, self.window_size, self.window_size, C))
+        shifted_x = window_reverse(attn_windows, self.window_size, H, W)  # B H' W' C
+
+        # reverse cyclic shift
+        x = self.reverse_cyclic_shift(shifted_x)
+        x = x[:, :self.input_resolution[0], :self.input_resolution[1]].reshape((B, -1, C))
+
+        # FFN
+        x = self.ffn(x, shortcut)
+
+        return x
+
+
+    def reverse_cyclic_shift(self, shifted_x):
+        if self.shift_size > 0:
+            x = paddle.roll(shifted_x, shifts=(self.shift_size, self.shift_size), axis=(1, 2))
+        else:
+            x = shifted_x
+        return x
+
+
+    def ffn(self, x, shortcut):
+        x = shortcut + self.drop_path(x if (not self.use_layerscale) else (self.gamma_1 * x))
+        x = x + self.drop_path(self.mlp(self.norm2(x)) if (not self.use_layerscale) else (
+                                                  self.gamma_2 * self.mlp(self.norm2(x))))
+        return x
+
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class PatchMerging(nn.Layer):
+    r""" Patch Merging Layer.
+    Args:
+        img_size (tuple[int]): Resolution of input feature.
+        in_chans (int): Number of input channels.
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+    """
+    def __init__(self, img_size, in_chans=3, norm_layer=nn.LayerNorm, **kwargs):
+        super().__init__()
+        self.input_resolution = img_size
+        self.dim = in_chans
+        weight_attr, bias_attr = self._init_weights()
+        self.reduction = nn.Linear(4 * in_chans, 2 * in_chans, bias_attr=False)
+        self.norm = norm_layer(4 * in_chans,
+                    weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                    bias_attr=bias_attr)
+
+    def forward(self, x):
+        """
+        x: B, C, H, W
+        """
+        B, C, H, W = x.shape 
+
+        x = x.transpose((0, 2, 3, 1))
+
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = paddle.concat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.reshape((B, -1, 4 * C))  # B H/2*W/2 4*C
+
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class BasicLayer(nn.Layer):
+    """ A basic Focal Transformer layer for one stage.
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resolution.
+        depth (int): Number of blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): Local window size.
+        expand_size (int): expand size for focal level 1.
+        expand_layer (str): expand layer. Default: all
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.0.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value.
+                                   Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0
+        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
+        pool_method (str): Window pooling method. Default: none.
+        focal_level (int): Number of focal levels. Default: 1.
+        focal_window (int): region size at each focal level. Default: 1.
+        use_conv_embed (bool): whether use overlapped convolutional patch embedding layer.
+                               Default: False
+        use_shift (bool): Whether use window shift as in Swin Transformer. Default: False
+        use_pre_norm (bool): Whether use pre-norm before patch embedding projection for stability.
+                             Default: False
+        downsample (nn.Module | None, optional): Downsample layer at the end of the layer.
+                             Default: None
+        use_layerscale (bool): Whether use layer scale for stability. Default: False.
+        layerscale_value (float): Layerscale value. Default: 1e-4.
+    """
+    def __init__(self, dim, input_resolution, depth, num_heads, window_size,
+                 expand_size, expand_layer="all", mlp_ratio=4., qkv_bias=True,
+                 qk_scale=None, drop=0., attn_drop=0., drop_path=0., norm_layer=nn.LayerNorm,
+                 pool_method="none", focal_level=1, focal_window=1, use_conv_embed=False,
+                 use_shift=False, use_pre_norm=False,downsample=None, use_layerscale=False,
+                 layerscale_value=1e-4):
+
+        super(BasicLayer, self).__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+
+        if expand_layer == "even":
+            expand_factor = 0
+        elif expand_layer == "odd":
+            expand_factor = 1
+        elif expand_layer == "all":
+            expand_factor = -1
+
+        # build blocks
+        self.blocks = nn.LayerList([
+            FocalTransformerBlock(dim=dim, input_resolution=input_resolution,
+                num_heads=num_heads, window_size=window_size,
+                shift_size=(0 if (i % 2 == 0) else window_size // 2) if use_shift else 0,
+                expand_size=0 if (i % 2 == expand_factor) else expand_size,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop,
+                attn_drop=attn_drop,
+                drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer,
+                pool_method=pool_method,
+                focal_level=focal_level,
+                focal_window=focal_window,
+                use_layerscale=use_layerscale,
+                layerscale_value=layerscale_value)
+            for i in range(depth)])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                img_size=input_resolution, patch_size=2, in_chans=dim, embed_dim=2*dim,
+                use_conv_embed=use_conv_embed, norm_layer=norm_layer, use_pre_norm=use_pre_norm,
+                is_stem=False
+            )
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        for blk in self.blocks:
+            x = blk(x)
+
+        if self.downsample is not None:
+            x = x.reshape((x.shape[0], self.input_resolution[0],
+                           self.input_resolution[1], -1)).transpose((0, 3, 1, 2))
+            x = self.downsample(x)
+        return x
+
+
+class PatchEmbed(nn.Layer):
+    r""" Image to Patch Embedding
+    Args:
+        img_size (int): Image size.  Default: 224.
+        patch_size (int): Patch token size. Default: 4.
+        in_chans (int): Number of input image channels. Default: 3.
+        embed_dim (int): Number of linear projection output channels. Default: 96.
+        use_conv_embed (bool): Wherther use overlapped convolutional embedding layer.
+                               Default: False.
+        norm_layer (nn.Module, optional): Normalization layer. Default: None
+        use_pre_norm (bool): Whether use pre-normalization before projection. Default: False
+        is_stem (bool): Whether current patch embedding is stem. Default: False
+    """
+
+    def __init__(self, img_size=(224, 224), patch_size=4, in_chans=3, embed_dim=96,
+                    use_conv_embed=False, norm_layer=None, use_pre_norm=False, is_stem=False):
+        super().__init__()
+        patch_size = (patch_size, patch_size)
+        patches_resolution = [img_size[0] // patch_size[0], img_size[1] // patch_size[1]]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+        self.use_pre_norm = use_pre_norm
+
+        weight_attr, bias_attr = self._init_weights()
+
+        if use_conv_embed:
+            # if we choose to use conv embedding,
+            # then we treat the stem and non-stem differently
+            if is_stem:
+                kernel_size = 7
+                padding = 2
+                stride = 4
+            else:
+                kernel_size = 3
+                padding = 1
+                stride = 2
+            self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=kernel_size,
+                                  stride=stride, padding=padding)
+        else:
+            self.proj = nn.Conv2D(in_chans, embed_dim,
+                                 kernel_size=patch_size, stride=patch_size)
+
+
+        if self.use_pre_norm:
+            if norm_layer is not None:
+                self.pre_norm = nn.GroupNorm(1, in_chans)
+            else:
+                self.pre_norm = None
+
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                bias_attr=bias_attr)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+
+        assert H == self.img_size[0] and W == self.img_size[1], \
+        f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+
+        if self.use_pre_norm:
+            x = self.pre_norm(x)
+
+        x = self.proj(x).flatten(2).transpose((0, 2, 1))  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class FocalTransformer(nn.Layer):
+    r"""Focal Transformer:Focal Self-attention for Local-Global Interactions in Vision Transformer
+    Args:
+        img_size (int | tuple(int)): Input image size. Default 224
+        patch_size (int | tuple(int)): Patch size. Default: 4
+        in_chans (int): Number of input image channels. Default: 3
+        num_classes (int): Number of classes for classification head. Default: 1000
+        embed_dim (int): Patch embedding dimension. Default: 96
+        depths (tuple(int)): Depth of each Focal Transformer layer.
+        num_heads (tuple(int)): Number of attention heads in different layers.
+        window_size (int): Window size. Default: 7
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
+        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True
+        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None
+        drop_rate (float): Dropout rate. Default: 0
+        attn_drop_rate (float): Attention dropout rate. Default: 0
+        drop_path_rate (float): Stochastic depth rate. Default: 0.1
+        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
+        ape (bool): If True, add absolute position embedding to
+                    the patch embedding. Default: False
+        patch_norm (bool): If True, add normalization after patch embedding. Default: True
+        use_shift (bool): Whether to use window shift proposed by Swin Transformer.
+                          We observe that using shift or not does not make difference to
+                          our Focal Transformer.Default: False
+        focal_stages (list): Which stages to perform focal attention.
+                             Default: [0, 1, 2, 3], means all stages
+        focal_levels (list): How many focal levels at all stages.
+                             Note that this excludes the finest-grain level. Default: [1, 1, 1, 1]
+        focal_windows (list): The focal window size at all stages. Default: [7, 5, 3, 1]
+        expand_stages (list): Which stages to expand the finest grain window.
+                              Default: [0, 1, 2, 3], means all stages
+        expand_sizes (list): The expand size for the finest grain level. Default: [3, 3, 3, 3]
+        expand_layer (str): Which layers we want to expand the window for the finest grain leve.
+                            This can save computational and memory cost
+                            without the loss of performance. Default: "all"
+        use_conv_embed (bool): Whether use convolutional embedding.
+                               We noted that using convolutional embedding
+                               usually improve the performance,
+                               but we do not use it by default. Default: False
+        use_layerscale (bool): Whether use layerscale proposed in CaiT. Default: False
+        layerscale_value (float): Value for layer scale. Default: 1e-4
+        use_pre_norm (bool): Whether use pre-norm in patch merging/embedding layer to
+                             control the feature magtigute. Default: False
+    """
+    def __init__(self,
+                img_size=224,
+                patch_size=4,
+                in_chans=3,
+                num_classes=1000,
+                embed_dim=96,
+                depths=[2, 2, 6, 2],
+                num_heads=[3, 6, 12, 24],
+                window_size=7,
+                mlp_ratio=4.,
+                qkv_bias=True,
+                qk_scale=None,
+                drop_rate=0.,
+                attn_drop_rate=0.,
+                drop_path_rate=0.1,
+                norm_layer=nn.LayerNorm,
+                ape=False,
+                patch_norm=True,
+                use_shift=False,
+                focal_stages=[0, 1, 2, 3],
+                focal_levels=[1, 1, 1, 1],
+                focal_windows=[7, 5, 3, 1],
+                focal_pool="fc",
+                expand_stages=[0, 1, 2, 3],
+                expand_sizes=[3, 3, 3, 3],
+                expand_layer="all",
+                use_conv_embed=False,
+                use_layerscale=False,
+                layerscale_value=1e-4,
+                use_pre_norm=False,
+                **kwargs):
+        super().__init__()
+
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2 ** (self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+
+        weight_attr, bias_attr = self._init_weights()
+
+        # split image into patches using either non-overlapped embedding
+        # or overlapped embedding
+        self.patch_embed = PatchEmbed(
+            img_size=(img_size, img_size),
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            use_conv_embed=use_conv_embed, is_stem=True,
+            norm_layer=norm_layer if self.patch_norm else None)
+
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = paddle.create_parameter(shape=(1, num_patches, embed_dim),
+                                      dtype=np.float32, is_bias=True,
+                                      attr=nn.initializer.TruncatedNormal(std=.02))
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        # stochastic depth decay rule
+        dpr = [x.numpy().item() for x in paddle.linspace(0, drop_path_rate, sum(depths))]
+
+        # build layers
+        self.layers = nn.LayerList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(dim=int(embed_dim * 2 ** i_layer),
+                    input_resolution=(patches_resolution[0] // (2 ** i_layer),
+                                        patches_resolution[1] // (2 ** i_layer)),
+                    depth=depths[i_layer],
+                    num_heads=num_heads[i_layer],
+                    window_size=window_size,
+                    mlp_ratio=self.mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    qk_scale=qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                    norm_layer=norm_layer,
+                    pool_method=focal_pool if i_layer in focal_stages else "none",
+                    downsample=PatchEmbed if (i_layer < self.num_layers - 1) else None,
+                    focal_level=focal_levels[i_layer],
+                    focal_window=focal_windows[i_layer],
+                    expand_size=expand_sizes[i_layer],
+                    expand_layer=expand_layer,
+                    use_conv_embed=use_conv_embed,
+                    use_shift=use_shift,
+                    use_pre_norm=use_pre_norm,
+                    use_layerscale=use_layerscale,
+                    layerscale_value=layerscale_value)
+            self.layers.append(layer)
+
+        self.norm = norm_layer(self.num_features,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+            bias_attr=bias_attr)
+        self.avgpool = nn.AdaptiveAvgPool1D(1)
+        self.head = nn.Linear(self.num_features, num_classes,
+            weight_attr=weight_attr, bias_attr=bias_attr) if num_classes > 0 else Identity()
+
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    def no_weight_decay_keywords(self):
+        return {'relative_position_bias_table',
+                'relative_position_bias_table_to_neighbors',
+                'relative_position_bias_table_to_windows'}
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+        x = self.norm(x)  # B L C
+        x = self.avgpool(x.transpose((0, 2, 1)))  # B C 1
+        x = paddle.flatten(x, 1)
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+def build_focal(config):
+    model = FocalTransformer(
+        img_size=config.DATA.IMAGE_SIZE,
+        patch_size=config.MODEL.FOCAL.PATCH_SIZE,
+        in_chans=config.MODEL.FOCAL.IN_CHANS,
+        num_classes=config.MODEL.NUM_CLASSES,
+        embed_dim=config.MODEL.FOCAL.EMBED_DIM,
+        depths=config.MODEL.FOCAL.DEPTHS,
+        num_heads=config.MODEL.FOCAL.NUM_HEADS,
+        window_size=config.MODEL.FOCAL.WINDOW_SIZE,
+        mlp_ratio=config.MODEL.FOCAL.MLP_RATIO,
+        qkv_bias=config.MODEL.FOCAL.QKV_BIAS,
+        qk_scale=config.MODEL.FOCAL.QK_SCALE,
+        drop_rate=config.MODEL.DROPOUT,
+        drop_path_rate=config.MODEL.DROP_PATH,
+        ape=config.MODEL.FOCAL.APE,
+        patch_norm=config.MODEL.FOCAL.PATCH_NORM,
+        use_shift=config.MODEL.FOCAL.USE_SHIFT,
+        expand_stages=config.MODEL.FOCAL.EXPAND_STAGES,
+        expand_sizes=config.MODEL.FOCAL.EXPAND_SIZES,
+        expand_layer=config.MODEL.FOCAL.EXPAND_LAYER,
+        focal_pool=config.MODEL.FOCAL.FOCAL_POOL,
+        focal_stages=config.MODEL.FOCAL.FOCAL_STAGES,
+        focal_windows=config.MODEL.FOCAL.FOCAL_WINDOWS,
+        focal_levels=config.MODEL.FOCAL.FOCAL_LEVELS,
+        use_conv_embed=config.MODEL.FOCAL.USE_CONV_EMBED,
+        use_layerscale=config.MODEL.FOCAL.USE_LAYERSCALE,
+        use_pre_norm=config.MODEL.FOCAL.USE_PRE_NORM
+    )
+    return model
diff --git a/image_classification/Focal_Transformer/losses.py b/image_classification/Focal_Transformer/losses.py
new file mode 100644
index 00000000..09a8ef28
--- /dev/null
+++ b/image_classification/Focal_Transformer/losses.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/main_multi_gpu.py b/image_classification/Focal_Transformer/main_multi_gpu.py
new file mode 100644
index 00000000..1429c427
--- /dev/null
+++ b/image_classification/Focal_Transformer/main_multi_gpu.py
@@ -0,0 +1,593 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Focal Transformer training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from focal_transformer import build_focal as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Focal Transformer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=32)
+    parser.add_argument('-image_size', type=int, default=224)
+    parser.add_argument('-num_classes', type=int, default=1000)
+    parser.add_argument('-data_path', type=str, default=None)
+
+    parser.add_argument('-output', type=str, default=None)
+
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+
+    parser.add_argument('-save_freq', type=int, default=1)
+    parser.add_argument('-log_freq', type=int, default=100)
+    parser.add_argument('-validate_freq', type=int, default=10)
+    parser.add_argument('-accum_iter', type=int, default=1)
+    parser.add_argument('-num_workers', type=int, default=1)
+    parser.add_argument('-ngpus', type=int, default=-1)
+
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/Focal_Transformer/main_single_gpu.py b/image_classification/Focal_Transformer/main_single_gpu.py
new file mode 100644
index 00000000..10d47f16
--- /dev/null
+++ b/image_classification/Focal_Transformer/main_single_gpu.py
@@ -0,0 +1,435 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Focal Transformer training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from focal_transformer import build_focal as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Focal Transformer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=32)
+    parser.add_argument('-image_size', type=int, default=224)
+    parser.add_argument('-num_classes', type=int, default=1000)
+    parser.add_argument('-data_path', type=str, default=None)
+
+    parser.add_argument('-output', type=str, default=None)
+
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+
+    parser.add_argument('-save_freq', type=int, default=1)
+    parser.add_argument('-log_freq', type=int, default=100)
+    parser.add_argument('-validate_freq', type=int, default=10)
+    parser.add_argument('-accum_iter', type=int, default=1)
+    parser.add_argument('-num_workers', type=int, default=1)
+    parser.add_argument('-ngpus', type=int, default=-1)
+
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/Focal_Transformer/mixup.py b/image_classification/Focal_Transformer/mixup.py
new file mode 100644
index 00000000..7dea0867
--- /dev/null
+++ b/image_classification/Focal_Transformer/mixup.py
@@ -0,0 +1,221 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/model.png b/image_classification/Focal_Transformer/model.png
new file mode 100644
index 00000000..44071cef
Binary files /dev/null and b/image_classification/Focal_Transformer/model.png differ
diff --git a/image_classification/Focal_Transformer/port_weights/__init__.py b/image_classification/Focal_Transformer/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_base.py b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_base.py
new file mode 100644
index 00000000..30c33fdc
--- /dev/null
+++ b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_base.py
@@ -0,0 +1,143 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import paddle
+import torch
+import timm
+import numpy as np
+from classification.build import build_model
+from focal_transformer import build_focal
+from config import *
+
+config = get_config('./configs/focal_base_patch4_window7_224.yaml')
+print(config)
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_focal(config)  # create model
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = build_model(config)  # create model
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./focal_base_patch4_window7_224.pdparams')  # save params
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done!')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_small.py b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_small.py
new file mode 100644
index 00000000..f316e141
--- /dev/null
+++ b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_small.py
@@ -0,0 +1,144 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import paddle
+import torch
+import timm
+import numpy as np
+from classification.build import build_model
+from focal_transformer import build_focal
+from config import *
+
+config = get_config('./configs/focal_small_patch4_window7_224.yaml')
+print(config)
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_focal(config)  # create model
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = build_model(config)  # create model
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./focal_small_patch4_window7_224.pdparams')  # save params
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done!')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_tiny.py b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_tiny.py
new file mode 100644
index 00000000..61e39c2b
--- /dev/null
+++ b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_tiny.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import paddle
+import torch
+import timm
+import numpy as np
+from classification.build import build_model
+from focal_transformer import build_focal
+from config import *
+
+
+config = get_config('./configs/focal_tiny_patch4_window7_224.yaml')
+print(config)
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_focal(config)  # create model
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = build_model(config)  # create model
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./focal_tiny_patch4_window7_224.pdparams')  # save params
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done!')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_base.py b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_base.py
new file mode 100644
index 00000000..8a2ca921
--- /dev/null
+++ b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_base.py
@@ -0,0 +1,150 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import paddle
+import torch
+import timm
+import numpy as np
+from classification.build import build_model
+from focal_transformer import build_focal
+from config import *
+
+
+config = get_config('./configs/focal_base_useconv_patch4_window7_224.yaml')
+print(config)
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            if 'relative_position_bias_' in th_name:
+                _set_value(th_name, pd_name, False)
+            else:
+                _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_focal(config)  # create model
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = build_model(config)  # create model
+    torch_model = torch_model.to(device)
+    model_state = torch.load('./focal_models/focal-base-useconv-is224-ws7.pth')
+    torch_model.load_state_dict(model_state['model'])
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./focal_base_useconv_patch4_window7_224.pdparams')  # save params
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done!')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_small.py b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_small.py
new file mode 100644
index 00000000..7b630966
--- /dev/null
+++ b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_small.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import paddle
+import torch
+import timm
+import numpy as np
+from classification.build import build_model
+from focal_transformer import build_focal
+from config import *
+
+
+config = get_config('./configs/focal_small_useconv_patch4_window7_224.yaml')
+print(config)
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_focal(config)  # create model
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = build_model(config)  # create model
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./focal_small_useconv_patch4_window7_224.pdparams')  # save params
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done!')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_tiny.py b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_tiny.py
new file mode 100644
index 00000000..3bd886e0
--- /dev/null
+++ b/image_classification/Focal_Transformer/port_weights/load_pytorch_weights_useconv_tiny.py
@@ -0,0 +1,145 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import paddle
+import torch
+import timm
+import numpy as np
+from classification.build import build_model
+from focal_transformer import build_focal
+from config import *
+
+
+config = get_config('./configs/focal_tiny_useconv_patch4_window7_224.yaml')
+print(config)
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def perpare_mapping(paddle_model,torch_model):
+    mapping=[]
+    for (name, param),(name2, param2) in zip(paddle_model.named_parameters(),torch_model.named_parameters()):
+        layer_mapping = [
+            (name2, name)
+        ]
+        mapping.extend(layer_mapping)
+    return mapping
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+        # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'**SET** {th_name} {th_shape} **TO** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = perpare_mapping(paddle_model,torch_model)
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            _set_value(th_name, pd_name)
+        else:  # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_focal(config)  # create model
+    paddle_model.eval()
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = build_model(config)  # create model
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    print("model convert done...")
+
+    # check correctness
+    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('torch infer done...')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[0, 0:100])
+    print('========================================================')
+    print(out_paddle[0, 0:100])
+    assert np.allclose(out_torch, out_paddle, atol=1e-4)
+
+    # save weights for paddle model
+    model_path = os.path.join('./focal_tiny_useconv_patch4_window7_224.pdparams')  # save params
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done!')
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/random_erasing.py b/image_classification/Focal_Transformer/random_erasing.py
new file mode 100644
index 00000000..f24c2ed4
--- /dev/null
+++ b/image_classification/Focal_Transformer/random_erasing.py
@@ -0,0 +1,94 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input_):
+        if len(input_.shape) == 3:
+            self._erase(input_, *input_.shape, input_.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input_.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input_[i], chan, img_h, img_w, input_.dtype)
+        return input_
diff --git a/image_classification/Focal_Transformer/run_eval.sh b/image_classification/Focal_Transformer/run_eval.sh
new file mode 100644
index 00000000..a8b4a711
--- /dev/null
+++ b/image_classification/Focal_Transformer/run_eval.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/focal_tiny_patch4_window7_224.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=64 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./focal_tiny_patch4_window7_224'
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/run_eval_multi.sh b/image_classification/Focal_Transformer/run_eval_multi.sh
new file mode 100644
index 00000000..f87b62f3
--- /dev/null
+++ b/image_classification/Focal_Transformer/run_eval_multi.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/focal_tiny_patch4_window7_224.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=32 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./focal_tiny_patch4_window7_224'
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/run_train.sh b/image_classification/Focal_Transformer/run_train.sh
new file mode 100644
index 00000000..18f98411
--- /dev/null
+++ b/image_classification/Focal_Transformer/run_train.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/focal_tiny_patch4_window7_224.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=4 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -output='./output'
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/run_train_amp.sh b/image_classification/Focal_Transformer/run_train_amp.sh
new file mode 100644
index 00000000..28a25227
--- /dev/null
+++ b/image_classification/Focal_Transformer/run_train_amp.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/focal_tiny_patch4_window7_224.yaml' \
+-dataset='imagenet2012' \
+-num_classes=1000 \
+-batch_size=4 \
+-image_size=224 \
+-data_path='/dataset/imagenet' \
+-output='./output' \
+-amp
\ No newline at end of file
diff --git a/image_classification/Focal_Transformer/run_train_multi.sh b/image_classification/Focal_Transformer/run_train_multi.sh
new file mode 100644
index 00000000..a65a1943
--- /dev/null
+++ b/image_classification/Focal_Transformer/run_train_multi.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/focal_tiny_patch4_window7_224.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=4 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -output='./output'
diff --git a/image_classification/Focal_Transformer/stat.py b/image_classification/Focal_Transformer/stat.py
new file mode 100644
index 00000000..935b54f5
--- /dev/null
+++ b/image_classification/Focal_Transformer/stat.py
@@ -0,0 +1,41 @@
+import os
+import glob
+import paddle
+from config import get_config
+from swin_transformer import build_swin as build_model
+
+def count_gelu(layer, input_, output):
+    activation_flops = 8
+    x = input_[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input_, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input_[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input_, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input_[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/swin_tiny_patch4_window7_224.yaml'
+input_size = (1, 3, 224, 224)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
diff --git a/image_classification/Focal_Transformer/transforms.py b/image_classification/Focal_Transformer/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/Focal_Transformer/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/Focal_Transformer/utils.py b/image_classification/Focal_Transformer/utils.py
new file mode 100644
index 00000000..54bd3a13
--- /dev/null
+++ b/image_classification/Focal_Transformer/utils.py
@@ -0,0 +1,114 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
\ No newline at end of file
diff --git a/image_classification/HVT/README.md b/image_classification/HVT/README.md
new file mode 100644
index 00000000..61418c5c
--- /dev/null
+++ b/image_classification/HVT/README.md
@@ -0,0 +1,174 @@
+# Scalable Vision Transformers with Hierarchical Pooling [arxiv](https://arxiv.org/abs/2103.10619) 
+
+PaddlePaddle training/validation code and pretrained models for **HVT**.
+
+The official pytorch implementation is [here](https://github.com/zhuang-group/HVT).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+
+<p align="center">
+<img src="./hvt.png" alt="drawing" width="100%" height="80%"/>
+<h4 align="center">HVT Model Overview</h4>
+</p>
+
+### Update 
+- Update (2021-12-28): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model          | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|----------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| HVT-Ti-1       | 69.45 | 89.28 | 5.7M    | 0.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/11BW-qLBMu_1TDAavlrAbfVlXB53dgm42/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16rZvJqL-UVuWFsCDuxFDqg?pwd=egds)(egds) |
+| HVT-S-0        | 80.30 | 95.15 | 22.0M   | 4.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1GlJ2j2QVFye1tAQoUJlgKTR_KELq3mSa/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L-tjDxkQx00jg7BsDClabA?pwd=hj7a)(hj7a) |
+| HVT-S-1        | 78.06 | 93.84 | 22.1M   | 2.4G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/16H33zNIpNrHBP1YhCq4zmLjRYQJ0XEmX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1quOsgVuxTcauISQ3SehysQ?pwd=tva8)(tva8) |
+| HVT-S-2        | 77.41 | 93.48 | 22.1M   | 1.9G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1U14LA7SXJtFep_SdUCjAV-cDOQ9A_OFk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nooWTBzaXyBtEgadn9VDmw?pwd=bajp)(bajp) |
+| HVT-S-3        | 76.30 | 92.88 | 22.1M   | 1.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/1m1CjOcZfPMLDRyX4QBgMhHV1m6rtu44v/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15sAOmQN6Hx0GLelYDuMQXw?pwd=rjch)(rjch) |
+| HVT-S-4        | 75.21 | 92.34 | 22.1M   | 1.6G   | 224        |  0.875   |  bicubic      |  [google](https://drive.google.com/file/d/14comGo9lO12dUeGGL52MuIJWZPSit7I0/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1o31hMRWR7FTCjUk7_fAOgA?pwd=ki4j)(ki4j) |
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./hvt_s2_patch16_224.pdparams`, to use the `hvt_s2_patch16_224` model in python:
+```python
+from config import get_config
+from hvt import build_hvt as build_model
+# config files in ./configs/
+config = get_config('./configs/hvt_s2_patch16_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./hvt_s2_patch16_224')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate HVT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/hvt_s2_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./hvt_s2_patch16_224'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/hvt_s2_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./hvt_s2_patch16_224'
+```
+
+</details>
+
+
+
+## Training
+To train the HVT Transformer model on ImageNet2012 with single GPU, run the following script using command line:
+
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg='./configs/hvt_s2_patch16_224.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=16 \
+  -data_path='/dataset/imagenet' 
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/hvt_s2_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' 
+```
+
+</details>
+
+## Visualization Attention Map
+
+<p align="center">
+<img src="./visualization_attn.png" alt="drawing" width="100%" height="80%"/>
+<h4 align="center">Feature visualization of ResNet50, DeiT-S and HVT-S-1 trained on ImageNet</h4>
+</p>
+
+## Reference
+```
+@inproceedings{pan2021scalable,
+  title={Scalable vision transformers with hierarchical pooling},
+  author={Pan, Zizheng and Zhuang, Bohan and Liu, Jing and He, Haoyu and Cai, Jianfei},
+  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
+  pages={377--386},
+  year={2021}
+}
+```
diff --git a/image_classification/HVT/augment.py b/image_classification/HVT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/HVT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/HVT/config.py b/image_classification/HVT/config.py
new file mode 100644
index 00000000..9c9dacf9
--- /dev/null
+++ b/image_classification/HVT/config.py
@@ -0,0 +1,184 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 256 #256 # train batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #64 # val batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 1 # number of data loading threads 
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] 
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'HVT'
+_C.MODEL.NAME = 'HVT'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PATCH_SIZE = 16
+_C.MODEL.TRANS.IN_CHANNELS = 3
+_C.MODEL.TRANS.EMBED_DIM = 384
+_C.MODEL.TRANS.DEPTH = 12
+_C.MODEL.TRANS.MLP_RATIO = 4.0
+_C.MODEL.TRANS.NUM_HEADS = 6
+_C.MODEL.TRANS.QKV_BIAS = True
+_C.MODEL.TRANS.INIT_VALUES = 1e-5
+_C.MODEL.TRANS.POOL_KERNEL_SIZE = 3
+_C.MODEL.TRANS.POOL_STRIDE = 2
+_C.MODEL.TRANS.POOL_BLOCK_WIDTH = 6
+
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.025
+_C.TRAIN.BASE_LR = 0.0005
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = False
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 42
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/HVT/configs/hvt_s0_patch16_224.yaml b/image_classification/HVT/configs/hvt_s0_patch16_224.yaml
new file mode 100644
index 00000000..ac90d198
--- /dev/null
+++ b/image_classification/HVT/configs/hvt_s0_patch16_224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: hvt
+    NAME: hvt_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 6
+        QKV_BIAS: True
+        POOL_KERNEL_SIZE: 3
+        POOL_STRIDE: 2
+        POOL_BLOCK_WIDTH: 0
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.025
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/HVT/configs/hvt_s1_patch16_224.yaml b/image_classification/HVT/configs/hvt_s1_patch16_224.yaml
new file mode 100644
index 00000000..6f748af1
--- /dev/null
+++ b/image_classification/HVT/configs/hvt_s1_patch16_224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: hvt
+    NAME: hvt_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 6
+        QKV_BIAS: True
+        POOL_KERNEL_SIZE: 3
+        POOL_STRIDE: 2
+        POOL_BLOCK_WIDTH: 12
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.025
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/HVT/configs/hvt_s2_patch16_224.yaml b/image_classification/HVT/configs/hvt_s2_patch16_224.yaml
new file mode 100644
index 00000000..543d3f5a
--- /dev/null
+++ b/image_classification/HVT/configs/hvt_s2_patch16_224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: hvt
+    NAME: hvt_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 6
+        QKV_BIAS: True
+        POOL_KERNEL_SIZE: 3
+        POOL_STRIDE: 2
+        POOL_BLOCK_WIDTH: 6
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.025
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/HVT/configs/hvt_s3_patch16_224.yaml b/image_classification/HVT/configs/hvt_s3_patch16_224.yaml
new file mode 100644
index 00000000..003f963c
--- /dev/null
+++ b/image_classification/HVT/configs/hvt_s3_patch16_224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: hvt
+    NAME: hvt_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 6
+        QKV_BIAS: True
+        POOL_KERNEL_SIZE: 3
+        POOL_STRIDE: 2
+        POOL_BLOCK_WIDTH: 4
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.025
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/HVT/configs/hvt_ti_1_patch16_224.yaml b/image_classification/HVT/configs/hvt_ti_1_patch16_224.yaml
new file mode 100644
index 00000000..69c7f8f4
--- /dev/null
+++ b/image_classification/HVT/configs/hvt_ti_1_patch16_224.yaml
@@ -0,0 +1,25 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: hvt
+    NAME: hvt_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 3
+        QKV_BIAS: True
+        POOL_KERNEL_SIZE: 3
+        POOL_BLOCK_WIDTH: 13
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/HVT/configs/scale_hvt_ti_4_patch16_224.yaml b/image_classification/HVT/configs/scale_hvt_ti_4_patch16_224.yaml
new file mode 100644
index 00000000..3ed48a2d
--- /dev/null
+++ b/image_classification/HVT/configs/scale_hvt_ti_4_patch16_224.yaml
@@ -0,0 +1,26 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: hvt
+    NAME: hvt_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 6
+        QKV_BIAS: True
+        POOL_KERNEL_SIZE: 3
+        POOL_STRIDE: 2
+        POOL_BLOCK_WIDTH: 3
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.025
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/HVT/datasets.py b/image_classification/HVT/datasets.py
new file mode 100644
index 00000000..18448892
--- /dev/null
+++ b/image_classification/HVT/datasets.py
@@ -0,0 +1,220 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for HVT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/HVT/droppath.py b/image_classification/HVT/droppath.py
new file mode 100644
index 00000000..d7ecf00c
--- /dev/null
+++ b/image_classification/HVT/droppath.py
@@ -0,0 +1,61 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    out = dp(tmp)
+#    print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/HVT/hvt.png b/image_classification/HVT/hvt.png
new file mode 100644
index 00000000..a413e72b
Binary files /dev/null and b/image_classification/HVT/hvt.png differ
diff --git a/image_classification/HVT/hvt.py b/image_classification/HVT/hvt.py
new file mode 100644
index 00000000..17c9cb00
--- /dev/null
+++ b/image_classification/HVT/hvt.py
@@ -0,0 +1,361 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement HVT
+"""
+
+import math
+import copy
+import paddle
+import paddle.nn as nn
+from droppath import DropPath
+
+
+class Identity(nn.Layer):
+    """ Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid using 'if' condition in forward methods
+    """
+
+    def __init__(self):
+        super(Identity, self).__init__()
+
+    def forward(self, x):
+        return x
+
+
+class PatchEmbedding(nn.Layer):
+    """Patch Embeddings
+
+    Then a proj (conv2d) layer is applied as the patch embedding.
+
+    Args:
+        image_size: int, input image size, default: 224
+        patch_size: int, patch size for patch embedding (k and stride for proj conv), default: 8
+        in_channels: int, input channels, default: 3
+        embed_dim: int, output dimension of patch embedding, default: 384
+    """
+
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 embed_dim=384):
+        super().__init__()
+        assert patch_size in [4, 8, 16]
+
+        # define patch embeddings
+        self.proj = nn.Conv2D(in_channels,
+                              embed_dim,
+                              kernel_size=patch_size,
+                              stride=patch_size)
+        # num patches
+        self.num_patches = (image_size // patch_size) * (image_size // patch_size)
+
+    def forward(self, x):
+        x = self.proj(x)
+        x = x.flatten(2)
+        x = x.transpose([0, 2, 1])
+        return x
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+
+    def __init__(self, in_features, hidden_features, dropout=0.):
+        super(Mlp, self).__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(in_features,
+                             hidden_features,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(hidden_features,
+                             in_features,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class Attention(nn.Layer):
+    """ Attention
+
+    Regular Attention module same as ViT
+
+    Args:
+        dim: int, all heads dimension
+        num_heads: int, num of heads
+        qkv_bias: bool, if True, qkv linear layer is using bias, default: False
+        qk_scale: float, if None, qk_scale is dim_head ** -0.5, default: None
+        attention_dropout: float, dropout rate for attention dropout, default: 0.
+        dropout: float, dropout rate for projection dropout, default: 0.
+    """
+
+    def __init__(self,
+                 dim,
+                 num_heads=8,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attention_dropout=0.,
+                 dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.embed_dim = dim
+        self.dim_head = dim // num_heads
+        self.scale = qk_scale or self.dim_head ** -0.5
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(dim,
+                             dim * 3,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else None)
+        self.attn_dropout = nn.Dropout(attention_dropout)
+        self.softmax = nn.Softmax(axis=-1)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.proj = nn.Linear(dim,
+                              dim,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
+        self.proj_dropout = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def transpose_multihead(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def forward(self, x):
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multihead, qkv)
+
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = attn * self.scale
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v)
+        z = z.transpose([0, 2, 1, 3])
+
+        new_shape = z.shape[:-2] + [self.embed_dim]
+        z = z.reshape(new_shape)
+        z = self.proj(z)
+        z = self.proj_dropout(z)
+
+        return z
+
+
+class EncoderLayer(nn.Layer):
+    """Transformer Encoder Layer
+
+    Transformer encoder module, same as ViT
+
+    Args:
+        dim: int, all heads dimension
+        num_heads: int, num of heads
+        mlp_ratio: float, ratio to multiply with dim for mlp hidden feature dim, default: 4.
+        qkv_bias: bool, if True, qkv linear layer is using bias, default: False
+        qk_scale: float, if None, qk_scale is dim_head ** -0.5, default: None
+        attention_dropout: float, dropout rate for attention dropout, default: 0.
+        dropout: float, dropout rate for projection dropout, default: 0.
+    """
+
+    def __init__(self,
+                 seq_len,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 downsample=None,
+                 attention_dropout=0,
+                 droppath=0.):
+        super().__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.norm1 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_1,
+                                  bias_attr=b_attr_1,
+                                  epsilon=1e-6)
+        self.attn = Attention(dim,
+                              num_heads=num_heads,
+                              qkv_bias=qkv_bias,
+                              qk_scale=qk_scale,
+                              attention_dropout=attention_dropout)
+        self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.norm2 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_2,
+                                  bias_attr=b_attr_2,
+                                  epsilon=1e-6)
+        self.mlp = Mlp(in_features=dim,
+                       hidden_features=int(dim * mlp_ratio))
+        self.downsample = downsample
+
+        if self.downsample:
+            self.pos_embed = paddle.create_parameter(
+                shape=[1, seq_len, dim],
+                dtype='float32',
+                default_initializer=nn.initializer.TruncatedNormal(std=.02))
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        h = x
+        x = self.norm1(x)
+        x = self.attn(x)
+        x = self.drop_path(x)
+        x = h + x
+
+        h = x
+        x = self.norm2(x)
+        x = self.mlp(x)
+        x = self.drop_path(x)
+        x = h + x
+
+        if self.downsample is not None:
+            x = self.downsample(x.transpose([0, 2, 1])).transpose([0, 2, 1])
+            x = x + self.pos_embed
+
+        return x
+
+
+class HVT(nn.Layer):
+    def __init__(self,
+                 image_size=224,
+                 in_channels=3,
+                 num_classes=1000,
+                 patch_size=16,
+                 embed_dim=384,
+                 num_heads=3,
+                 depth=12,
+                 mlp_ratio=4,
+                 pool_block_width=6,
+                 pool_kernel_size=3,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.num_classes = num_classes
+        # patch embedding
+        self.patch_embed = PatchEmbedding(image_size=image_size,
+                                          patch_size=patch_size,
+                                          in_channels=in_channels,
+                                          embed_dim=embed_dim)
+        # positional embedding
+        self.pos_embed = paddle.create_parameter(
+            shape=[1, self.patch_embed.num_patches, embed_dim],
+            dtype='float32',
+            default_initializer=nn.initializer.TruncatedNormal(std=.02))
+
+        self.pos_dropout = nn.Dropout(dropout)
+        self.num_patches = (image_size//patch_size)*(image_size//patch_size)
+        seq_len = self.num_patches
+
+        self.layers = nn.LayerList([])
+
+        for i in range(depth):
+            if pool_block_width == 0:
+                downsample = None
+            elif i == 0 or i % pool_block_width == 0:
+                seq_len = math.floor((seq_len - pool_kernel_size) / 2 + 1)
+                downsample = nn.MaxPool1D(kernel_size=pool_kernel_size, stride=2)
+            else:
+                downsample = None
+            self.layers.append(
+                copy.deepcopy(EncoderLayer(seq_len,
+                                           dim=embed_dim,
+                                           num_heads=num_heads,
+                                           mlp_ratio=mlp_ratio,
+                                           downsample=downsample,
+                                           qkv_bias=qkv_bias,
+                                           attention_dropout=attention_dropout,
+                                           droppath=droppath)))
+
+        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6)
+
+        self.head = nn.Linear(embed_dim, num_classes, bias_attr=True)
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        x = x + self.pos_embed
+        x = self.pos_dropout(x)
+
+        for layer in self.layers:
+            x = layer(x)
+
+        x = self.norm(x)
+        x = x.mean(axis=1)
+
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+
+def build_hvt(config):
+    """build hvt model using config"""
+    model = HVT(image_size=config.DATA.IMAGE_SIZE,
+                in_channels=config.MODEL.TRANS.IN_CHANNELS,
+                num_classes=config.MODEL.NUM_CLASSES,
+                patch_size=config.MODEL.TRANS.PATCH_SIZE,
+                embed_dim=config.MODEL.TRANS.EMBED_DIM,
+                num_heads=config.MODEL.TRANS.NUM_HEADS,
+                depth=config.MODEL.TRANS.DEPTH,
+                mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+                qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                dropout=config.MODEL.DROPOUT,
+                pool_block_width=config.MODEL.TRANS.POOL_BLOCK_WIDTH,
+                pool_kernel_size=config.MODEL.TRANS.POOL_KERNEL_SIZE,
+                attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+                droppath=config.MODEL.DROPPATH)
+    return model
+
diff --git a/image_classification/HVT/losses.py b/image_classification/HVT/losses.py
new file mode 100644
index 00000000..04377eac
--- /dev/null
+++ b/image_classification/HVT/losses.py
@@ -0,0 +1,144 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, label smoothing rate
+        x: tensor, predictions (default is before softmax) with shape [N, num_classes] as default
+        target: tensor, target label with shape [N] as default
+        weight: tensor, optional, a manual rescaling weight given to each class        
+        reduction: str, optional, indicate how to average the loss by batch_size,
+                   default is ``'mean'``, the candicates are ``'none'`` | ``'mean'`` | ``'sum'``
+        axis: int, optional, the index of dimension to perform softmax calculations,
+                   default is ``-1``, if `axis` is not -1 -> the shape of x and target may not be default
+        use_softmax: bool, optional, if `use_softmax` is ``False``, ``x`` should be after softmax,
+                     default is ``True``, the candicates are ``True`` | ``False``
+        name: str, optional, the name of the operator, default is ``None``,
+              for more information, please refer to :ref:`api_guide_Name`.
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self,
+                 smoothing=0.1,
+                 weight=None,                 
+                 reduction='mean',                 
+                 axis=-1,
+                 use_softmax=True,
+                 name=None):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.weight = weight
+        self.reduction = reduction        
+        self.axis = axis
+        self.use_softmax = use_softmax
+        self.name = name
+
+    def forward(self, x, target):
+        target = paddle.nn.functional.one_hot(target, num_classes=x.shape[1])
+        target = paddle.nn.functional.label_smooth(target, epsilon=self.smoothing)        
+        loss = paddle.nn.functional.cross_entropy(
+            x,
+            target,            
+            weight=self.weight,            
+            reduction=self.reduction,
+            soft_label=True,
+            axis=self.axis,
+            use_softmax=self.use_softmax,
+            name=self.name)
+        return loss
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/HVT/main_multi_gpu.py b/image_classification/HVT/main_multi_gpu.py
new file mode 100644
index 00000000..e2c04793
--- /dev/null
+++ b/image_classification/HVT/main_multi_gpu.py
@@ -0,0 +1,606 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""HVT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from model_ema import ModelEma
+from hvt import build_hvt as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('HVT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True:
+            # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image) # output[0]: class_token, output[1]: distill_token
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:
+            # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
+
+        # average of output and kd_output, like model eval mode
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/main_single_gpu.py b/image_classification/HVT/main_single_gpu.py
new file mode 100644
index 00000000..02b91ddd
--- /dev/null
+++ b/image_classification/HVT/main_single_gpu.py
@@ -0,0 +1,448 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""HVT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import copy
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from hvt import build_hvt as build_model
+from model_ema import ModelEma
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('HVT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True:
+            # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image) # output[0]: class_token, output[1]: distill_token
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:
+            # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None:
+            model_ema.update(model)
+
+        # average of output and kd_output, like model eval mode
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 7: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 8: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 9: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/mixup.py b/image_classification/HVT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/HVT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/HVT/model_ema.py b/image_classification/HVT/model_ema.py
new file mode 100644
index 00000000..8a636765
--- /dev/null
+++ b/image_classification/HVT/model_ema.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/HVT/port_weights/load_hvt_s_0_pytorch_weights.py b/image_classification/HVT/port_weights/load_hvt_s_0_pytorch_weights.py
new file mode 100644
index 00000000..3fa17ac5
--- /dev/null
+++ b/image_classification/HVT/port_weights/load_hvt_s_0_pytorch_weights.py
@@ -0,0 +1,173 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from hvt import build_hvt
+from model_th import hvt_model
+import os
+from config import *
+import json
+
+model_name = 'hvt_s0_patch16_224'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.proj')
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        th_prefix = f'blocks.{idx}'
+        pp_prefix = f'layers.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2')
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head'),
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_hvt(config)
+    paddle_model.eval()
+
+    # print(paddle_model)
+    # print_model_named_params(paddle_model)
+    # print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = hvt_model()
+    # print(torch_model)
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    checkpoint = torch.load('hvt_s_0.pth')['model']
+    for sub_item in checkpoint:
+        print(sub_item)
+
+    torch_model.load_state_dict(checkpoint)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+
+    print('+++++++++++++++++++++++++++++++++++')
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+
+    out_torch = torch_model(x_torch)
+    out_torch = out_torch.detach().cpu().numpy()
+
+    # for out_paddle,out_torch in zip(out_paddle_np,out_torch_np):
+    out_diff = np.allclose(out_torch, out_paddle, atol=1e-5)
+    print(out_diff)
+    print(np.sum(out_torch), np.sum(out_paddle))
+
+    assert np.allclose(out_torch, out_paddle, atol=1e-5)
+
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/port_weights/load_hvt_s_2_pytorch_weights.py b/image_classification/HVT/port_weights/load_hvt_s_2_pytorch_weights.py
new file mode 100644
index 00000000..1234c66f
--- /dev/null
+++ b/image_classification/HVT/port_weights/load_hvt_s_2_pytorch_weights.py
@@ -0,0 +1,174 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from hvt import build_hvt
+from model_th import hvt_model
+import os
+from config import *
+import json
+
+model_name = 'hvt_s2_patch16_224'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.proj'),
+        ('blocks.0.pos_embed', f'layers.0.pos_embed'),
+        ('blocks.6.pos_embed', f'layers.6.pos_embed')
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        th_prefix = f'blocks.{idx}'
+        pp_prefix = f'layers.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2')
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head'),
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_hvt(config)
+    paddle_model.eval()
+
+    # print(paddle_model)
+    # print_model_named_params(paddle_model)
+    # print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = hvt_model()
+    # print(torch_model)
+    # print_model_named_params(torch_model)
+    # print_model_named_buffers(torch_model)
+
+    checkpoint = torch.load('hvt_s_2.pth')['model']
+    for sub_item in checkpoint:
+        print(sub_item)
+
+    torch_model.load_state_dict(checkpoint)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+    print('+++++++++++++++++++++++++++++++++++')
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+
+    out_torch = torch_model(x_torch)
+    out_torch = out_torch.detach().cpu().numpy()
+
+    # for out_paddle,out_torch in zip(out_paddle_np,out_torch_np):
+    out_diff = np.allclose(out_torch, out_paddle, atol=1e-5)
+    print(out_diff)
+    print(np.sum(out_torch), np.sum(out_paddle))
+
+    assert np.allclose(out_torch, out_paddle, atol=1e-5)
+
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/port_weights/load_hvt_s_3_pytorch_weights.py b/image_classification/HVT/port_weights/load_hvt_s_3_pytorch_weights.py
new file mode 100644
index 00000000..74c75413
--- /dev/null
+++ b/image_classification/HVT/port_weights/load_hvt_s_3_pytorch_weights.py
@@ -0,0 +1,176 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from hvt import build_hvt
+from model_th import hvt_model
+import os
+from config import *
+import json
+
+model_name = 'hvt_s3_patch16_224'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.proj'),
+        ('blocks.0.pos_embed', f'layers.0.pos_embed'),
+        ('blocks.4.pos_embed', f'layers.4.pos_embed'),
+        ('blocks.8.pos_embed', f'layers.8.pos_embed')
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        th_prefix = f'blocks.{idx}'
+        pp_prefix = f'layers.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2')
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head'),
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_hvt(config)
+    paddle_model.eval()
+
+    # print(paddle_model)
+    # print_model_named_params(paddle_model)
+    # print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = hvt_model()
+    # print(torch_model)
+    # print_model_named_params(torch_model)
+    # print_model_named_buffers(torch_model)
+
+    checkpoint = torch.load('hvt_s_3.pth')['model']
+    for sub_item in checkpoint:
+        print(sub_item)
+
+    torch_model.load_state_dict(checkpoint)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+
+    print('+++++++++++++++++++++++++++++++++++')
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+
+    out_torch = torch_model(x_torch)
+    out_torch = out_torch.detach().cpu().numpy()
+
+    # for out_paddle,out_torch in zip(out_paddle_np,out_torch_np):
+    out_diff = np.allclose(out_torch, out_paddle, atol=1e-5)
+    print(out_diff)
+    print(np.sum(out_torch), np.sum(out_paddle))
+
+    assert np.allclose(out_torch, out_paddle, atol=1e-5)
+
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/port_weights/load_hvt_ti_1_pytorch_weights.py b/image_classification/HVT/port_weights/load_hvt_ti_1_pytorch_weights.py
new file mode 100644
index 00000000..6b1c91e7
--- /dev/null
+++ b/image_classification/HVT/port_weights/load_hvt_ti_1_pytorch_weights.py
@@ -0,0 +1,174 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from hvt import build_hvt
+from model_th import hvt_model
+import os
+from config import *
+import json
+
+model_name = 'hvt_ti_1_patch16_224'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.proj'),
+        ('blocks.0.pos_embed', f'layers.0.pos_embed')
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        th_prefix = f'blocks.{idx}'
+        pp_prefix = f'layers.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2')
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head'),
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_hvt(config)
+    paddle_model.eval()
+
+    # print(paddle_model)
+    # print_model_named_params(paddle_model)
+    # print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = hvt_model()
+    # print(torch_model)
+    # print_model_named_params(torch_model)
+    # print_model_named_buffers(torch_model)
+
+    checkpoint = torch.load('hvt_ti_1.pth')['model']
+    for sub_item in checkpoint:
+        print(sub_item)
+
+    torch_model.load_state_dict(checkpoint)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+
+    print('+++++++++++++++++++++++++++++++++++')
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+
+    out_torch = torch_model(x_torch)
+    out_torch = out_torch.detach().cpu().numpy()
+
+    # for out_paddle,out_torch in zip(out_paddle_np,out_torch_np):
+    out_diff = np.allclose(out_torch, out_paddle, atol=1e-5)
+    print(out_diff)
+    print(np.sum(out_torch), np.sum(out_paddle))
+
+    assert np.allclose(out_torch, out_paddle, atol=1e-5)
+
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/port_weights/load_scale_hvt_ti_4_pytorch_weights.py b/image_classification/HVT/port_weights/load_scale_hvt_ti_4_pytorch_weights.py
new file mode 100644
index 00000000..578b3843
--- /dev/null
+++ b/image_classification/HVT/port_weights/load_scale_hvt_ti_4_pytorch_weights.py
@@ -0,0 +1,177 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from hvt import build_hvt
+from model_th import hvt_model
+import os
+from config import *
+import json
+
+model_name = 'scale_hvt_ti_4_patch16_224'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', f'patch_embed.proj'),
+        ('blocks.0.pos_embed', f'layers.0.pos_embed'),
+        ('blocks.3.pos_embed', f'layers.3.pos_embed'),
+        ('blocks.6.pos_embed', f'layers.6.pos_embed'),
+        ('blocks.9.pos_embed', f'layers.9.pos_embed')
+    ]
+
+    num_layers = config.MODEL.TRANS.DEPTH
+    for idx in range(num_layers):
+        th_prefix = f'blocks.{idx}'
+        pp_prefix = f'layers.{idx}'
+        layer_mapping = [
+            (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+            (f'{th_prefix}.attn.qkv', f'{pp_prefix}.attn.qkv'),
+            (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+            (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+            (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+            (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2')
+        ]
+        mapping.extend(layer_mapping)
+
+    head_mapping = [
+        ('norm', 'norm'),
+        ('head', 'head'),
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_hvt(config)
+    paddle_model.eval()
+
+    # print(paddle_model)
+    # print_model_named_params(paddle_model)
+    # print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    torch_model = hvt_model()
+    # print(torch_model)
+    # print_model_named_params(torch_model)
+    # print_model_named_buffers(torch_model)
+
+    checkpoint = torch.load('scale_hvt_ti_4.pth')['model']
+    for sub_item in checkpoint:
+        print(sub_item)
+
+    torch_model.load_state_dict(checkpoint)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+
+    print('+++++++++++++++++++++++++++++++++++')
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+
+    out_torch = torch_model(x_torch)
+    out_torch = out_torch.detach().cpu().numpy()
+
+    # for out_paddle,out_torch in zip(out_paddle_np,out_torch_np):
+    out_diff = np.allclose(out_torch, out_paddle, atol=1e-5)
+    print(out_diff)
+    print(np.sum(out_torch), np.sum(out_paddle))
+
+    assert np.allclose(out_torch, out_paddle, atol=1e-5)
+
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HVT/random_erasing.py b/image_classification/HVT/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/HVT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/HVT/run_eval.sh b/image_classification/HVT/run_eval.sh
new file mode 100644
index 00000000..dce22070
--- /dev/null
+++ b/image_classification/HVT/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/hvt_s2_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./hvt_s2_patch16_224'
diff --git a/image_classification/HVT/run_eval_multi.sh b/image_classification/HVT/run_eval_multi.sh
new file mode 100644
index 00000000..7f757ad5
--- /dev/null
+++ b/image_classification/HVT/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/hvt_s2_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./hvt_s2_patch16_224'
diff --git a/image_classification/HVT/run_train.sh b/image_classification/HVT/run_train.sh
new file mode 100644
index 00000000..1ad833c5
--- /dev/null
+++ b/image_classification/HVT/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg='./configs/hvt_s2_patch16_224.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=16 \
+  -data_path='/dataset/imagenet'
diff --git a/image_classification/HVT/run_train_multi.sh b/image_classification/HVT/run_train_multi.sh
new file mode 100644
index 00000000..364cc6d4
--- /dev/null
+++ b/image_classification/HVT/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/hvt_s2_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet'
diff --git a/image_classification/HVT/stats_define.py b/image_classification/HVT/stats_define.py
new file mode 100644
index 00000000..d4c6bd88
--- /dev/null
+++ b/image_classification/HVT/stats_define.py
@@ -0,0 +1,61 @@
+import os
+import glob
+import paddle
+from config import get_config
+from model_pd import build_deit as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/deit_base_patch16_384.yaml'
+input_size = (1, 3, 384, 384)
+#input_size = (1, 3, 224, 224)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/HVT/utils.py b/image_classification/HVT/utils.py
new file mode 100644
index 00000000..ff833c23
--- /dev/null
+++ b/image_classification/HVT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for HVT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/HVT/visualization_attn.png b/image_classification/HVT/visualization_attn.png
new file mode 100644
index 00000000..335e6f97
Binary files /dev/null and b/image_classification/HVT/visualization_attn.png differ
diff --git a/image_classification/HaloNet/README.md b/image_classification/HaloNet/README.md
new file mode 100644
index 00000000..39da1493
--- /dev/null
+++ b/image_classification/HaloNet/README.md
@@ -0,0 +1,165 @@
+# Scaling Local Self-Attention for Parameter Efficient Visual Backbones, [arxiv](https://https://arxiv.org/abs/2103.12731) 
+
+PaddlePaddle training/validation code and pretrained models for **HaloNet**.
+
+The official pytorch implementation is N/A.
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+
+<p align="center">
+<img src="./img1.png" alt="drawing" width="100%" height="100%"/>
+    <h4 align="center">HaloNet local self-attention architecture</h4>
+</p>
+
+### Update 
+* Update (2021-12-09): Initial code and ported weights are released.
+
+## Models Zoo
+| Model          | Acc@1 	| Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|----------------|----------|-------|---------|--------|------------|----------|---------------|--------------|
+| halonet26t 	 | 79.10	| 94.31	| 12.5M    | 3.2G   | 256        | 0.95     | bicubic       |[google](https://drive.google.com/file/d/1F_a1brftXXnPM39c30NYe32La9YZQ0mW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FSlSTuYMpwPJpi4Yz2nCTA)(ednv)  |
+| halonet50ts 	 | 81.65	| 95.61	| 22.8M    | 5.1G   | 256        | 0.94     | bicubic       |[google](https://drive.google.com/file/d/12t85kJcPA377XePw6smch--ELMBo6p0Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1X4LM-sqoTKG7CrM5BNjcdA)(3j9e)  |
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./halonet_50ts_256.pdparams`, to use the `halonet_50ts_256` model in python:
+```python
+from config import get_config
+from halonet import build_halonet 
+# config files in ./configs/
+config = get_config('./configs/halonet_50ts_256.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./halonet_50ts_256')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate HaloNet model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/halonet_50ts_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./halonet_50ts_256'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/halonet_50ts_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./halonet_50ts_256'
+```
+
+</details>
+
+
+## Training
+To train the MobileVit XXS model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_singel_gpu.py \
+  -cfg='./configs/halonet_50ts_256.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/halonet_50ts_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@inproceedings{vaswani2021scaling,
+  title={Scaling local self-attention for parameter efficient visual backbones},
+  author={Vaswani, Ashish and Ramachandran, Prajit and Srinivas, Aravind and Parmar, Niki and Hechtman, Blake and Shlens, Jonathon},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={12894--12904},
+  year={2021}
+}
+```
diff --git a/image_classification/HaloNet/__init__.py b/image_classification/HaloNet/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/HaloNet/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/HaloNet/augment.py b/image_classification/HaloNet/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/HaloNet/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/HaloNet/config.py b/image_classification/HaloNet/config.py
new file mode 100755
index 00000000..9b33c1f4
--- /dev/null
+++ b/image_classification/HaloNet/config.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 256 # input image size
+_C.DATA.CROP_PCT = 0.94 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'halo'
+_C.MODEL.NAME = 'halonet_50ts'
+_C.MODEL.RESUME = None
+_C.MODEL.RESUME_EMA = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.ACT = None
+_C.MODEL.STAGE1_BLOCK = ['bottle', 'bottle', 'bottle']
+_C.MODEL.STAGE2_BLOCK = ['bottle', 'bottle', 'bottle', 'attn']
+_C.MODEL.STAGE3_BLOCK = ['bottle', 'attn', 'bottle', 'attn', 'bottle', 'attn']
+_C.MODEL.STAGE4_BLOCK = ['bottle', 'attn', 'bottle']
+_C.MODEL.CHANNEL = [64, 256, 512, 1024, 2048]
+_C.MODEL.NUM_HEAD = [0, 4, 8, 8]
+_C.MODEL.STRIDE = [1, 2, 2, 2]
+_C.MODEL.DEPTH = [3, 4, 6, 3]
+_C.MODEL.BLOCK_SIZE = 8
+_C.MODEL.HALO_SIZE = 3
+_C.MODEL.HIDDEN_CHANNEL = 1024
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 150
+_C.TRAIN.WARMUP_EPOCHS = 3
+_C.TRAIN.WEIGHT_DECAY = 0.00008
+_C.TRAIN.BASE_LR = 0.1
+_C.TRAIN.WARMUP_START_LR = 0.0002
+_C.TRAIN.END_LR = 0.0002
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/HaloNet/configs/halonet_26t_256.yaml b/image_classification/HaloNet/configs/halonet_26t_256.yaml
new file mode 100755
index 00000000..104afed4
--- /dev/null
+++ b/image_classification/HaloNet/configs/halonet_26t_256.yaml
@@ -0,0 +1,32 @@
+DATA:
+    IMAGE_SIZE: 256
+    CROP_PCT: 0.95
+MODEL:
+    TYPE: halo
+    NAME: halonet_26t
+    #PRETRAINED: halonet_26t_256
+    ACT: relu
+    BLOCK_SIZE: 8
+    HALO_SIZE: 2
+    STAGE1_BLOCK: ['bottle','bottle']
+    STAGE2_BLOCK: ['bottle','bottle']
+    STAGE3_BLOCK: ['bottle','attn']
+    STAGE4_BLOCK: ['attn', 'attn']
+    CHANNEL: [64,256,512,1024,2048]
+    HIDDEN_CHANNEL: 1024
+    NUM_HEAD: [0,0,8,8]
+    STRIDE: [1,2,2,2]
+    DEPTH: [2,2,2,2]
+    NUM_CLASSES: 1000
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 1
+
+
+
+
diff --git a/image_classification/HaloNet/configs/halonet_50ts_256.yaml b/image_classification/HaloNet/configs/halonet_50ts_256.yaml
new file mode 100755
index 00000000..acfcbf80
--- /dev/null
+++ b/image_classification/HaloNet/configs/halonet_50ts_256.yaml
@@ -0,0 +1,32 @@
+DATA:
+    IMAGE_SIZE: 256
+    CROP_PCT: 0.94
+MODEL:
+    TYPE: halo
+    NAME: halonet_50ts
+    ACT: silu
+    BLOCK_SIZE: 8
+    HALO_SIZE: 3
+    #PRETRAINED: halonet_50ts_256
+    STAGE1_BLOCK: ['bottle','bottle','bottle']
+    STAGE2_BLOCK: ['bottle','bottle','bottle','attn']
+    STAGE3_BLOCK: ['bottle','attn','bottle','attn','bottle','attn']
+    STAGE4_BLOCK: ['bottle', 'attn', 'bottle']
+    CHANNEL: [64,256,512,1024,2048]
+    HIDDEN_CHANNEL: None
+    NUM_HEAD: [0,4,8,8]
+    STRIDE: [1,2,2,2]
+    DEPTH: [3,4,6,3]
+    NUM_CLASSES: 1000
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 1
+
+
+
+
diff --git a/image_classification/HaloNet/datasets.py b/image_classification/HaloNet/datasets.py
new file mode 100755
index 00000000..1752a66d
--- /dev/null
+++ b/image_classification/HaloNet/datasets.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/HaloNet/drop.py b/image_classification/HaloNet/drop.py
new file mode 100755
index 00000000..5504a0b9
--- /dev/null
+++ b/image_classification/HaloNet/drop.py
@@ -0,0 +1,23 @@
+import paddle
+import paddle.nn as nn
+
+
+def drop_path(x, drop_prob=0., training=False):
+    if drop_prob == 0. or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=x.dtype)
+    # random_tensor.to(x.device)
+    random_tensor = random_tensor.floor()
+    output = x.divide(keep_prob) * random_tensor
+    return output
+
+
+class DropPath(nn.Layer):
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
diff --git a/image_classification/HaloNet/halonet.py b/image_classification/HaloNet/halonet.py
new file mode 100755
index 00000000..78b75c52
--- /dev/null
+++ b/image_classification/HaloNet/halonet.py
@@ -0,0 +1,680 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Network Class for HaloNet
+"""
+
+import paddle
+from paddle import nn
+
+
+def make_divisible(v, divisor=8, min_value=None, round_limit=.9):
+    """ calculate new vector dim according to input vector dim
+    """
+    min_value = min_value or divisor
+    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
+    # Make sure that round down does not go down by more than 10%.
+    if new_v < round_limit * v:
+        new_v += divisor
+    return new_v
+
+def init_weights():
+    """ init Linear weight
+    """
+    weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+    bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+    return weight_attr, bias_attr
+
+
+class Identity(nn.Layer):
+    """ Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class ConvBnAct(nn.Layer):
+    """ Build layer contain: conv - bn - act
+    """
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size,
+                 stride,
+                 padding,
+                 act=None,
+                 bias_attr=False,
+                 groups=1):
+        super().__init__()
+        self.conv = nn.Conv2D(in_channels=in_channels,
+                              out_channels=out_channels,
+                              kernel_size=kernel_size,
+                              stride=stride,
+                              padding=padding,
+                              groups=groups,
+                              weight_attr=paddle.ParamAttr(
+                                  initializer=nn.initializer.KaimingUniform()),
+                              bias_attr=bias_attr)
+        self.bn = nn.BatchNorm2D(out_channels)
+
+        self.act = ActLayer(act)
+
+    def forward(self, inputs):
+        out = self.conv(inputs)
+        out = self.bn(out)
+        out = self.act(out)
+        return out
+
+
+class BatchNormAct2d(nn.Layer):
+    """ Build layer contain: bn-act
+    """
+    def __init__(self, chs, act=None):
+        super().__init__()
+        self.bn = nn.BatchNorm2D(chs)
+        self.act = ActLayer(act)
+
+    def forward(self, inputs):
+        out = self.bn(inputs)
+        out = self.act(out)
+        return out
+
+
+class ActLayer(nn.Layer):
+    """ Build Activation Layer according to act type
+    """
+    def __init__(self, act):
+        super().__init__()
+        if act == 'silu':
+            self.act = nn.Silu()
+        elif act == 'relu':
+            self.act = nn.ReLU()
+        else:
+            self.act = Identity()
+
+    def forward(self, x):
+        out = self.act(x)
+        return out
+
+
+class SelectAdaptivePool2d(nn.Layer):
+    """ Selectable global pooling layer with dynamic input kernel size
+    """
+    def __init__(self, output_size=1, pool_type='avg', flatten=False):
+        super().__init__()
+        # convert other false values to empty string for consistent TS typing
+        self.pool_type = pool_type or ''
+        self.flatten = nn.Flatten(1) if flatten else Identity()
+        if pool_type == '':
+            self.pool = Identity()
+        elif pool_type == 'avg':
+            self.pool = nn.AdaptiveAvgPool2D(output_size)
+        else:
+            assert False, 'Invalid pool type: %s' % pool_type
+
+    def forward(self, x):
+        x = self.pool(x)
+        x = self.flatten(x)
+        return x
+
+
+class Stem(nn.Layer):
+    def __init__(self, act):
+        super().__init__()
+
+        self.conv1 = ConvBnAct(3, 24, kernel_size=3, stride=2, padding=1, act=act)
+        self.conv2 = ConvBnAct(24, 32, kernel_size=3, stride=1, padding=1, act=act)
+        self.conv3 = ConvBnAct(32, 64, kernel_size=3, stride=1, padding=1, act=act)
+        self.pool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1, ceil_mode=False)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.pool(x)
+        return x
+
+
+def rel_logits_1d(q, rel_k, permute_mask):
+    """ Compute relative logits along one dimension
+    :param q: [batch,H,W,dim]
+    :param rel_k: [2*window-1,dim]
+    :param permute_mask: permute output axis according to this
+    """
+
+    B, H, W, _ = q.shape
+    rel_size = rel_k.shape[0]
+    win_size = (rel_size+1)//2
+
+    rel_k = paddle.transpose(rel_k, [1, 0])
+    x = (q@rel_k)
+    x = x.reshape([-1, W, rel_size])
+
+    # pad to shift from relative to absolute indexing
+    x_pad = paddle.nn.functional.pad(x, [0, 1],data_format='NCL')
+    x_pad = x_pad.flatten(1)
+    x_pad = x_pad.unsqueeze(1)
+    x_pad = paddle.nn.functional.pad(x_pad, [0, rel_size - W], data_format='NCL')
+    x_pad = x_pad.squeeze()
+
+    # reshape adn slice out the padded elements
+    x_pad = x_pad.reshape([-1, W+1, rel_size])    #[25088,9,27]
+    x = x_pad[:, :W, win_size-1:]     # [25088,8,14]
+
+    # reshape and tile
+    x = x.reshape([B, H, 1, W, win_size])
+    x = x.expand([-1, -1, win_size, -1, -1])
+    x = paddle.transpose(x, permute_mask)
+
+    return x
+
+
+class RelPosEmb(nn.Layer):
+    """ Relative Position Embedding
+    """
+    def __init__(self,
+                 block_size,
+                 win_size,
+                 dim_head,
+                 ):
+        """
+        :param block_size (int): block size
+        :param win_size (int): neighbourhood window size
+        :param dim_head (int): attention head dim
+        :param scale (float): scale factor (for init)
+        """
+        super().__init__()
+
+        self.block_size = block_size
+        self.dim_head = dim_head
+
+        self.rel_height = paddle.create_parameter(
+            shape=[(2 * win_size - 1), dim_head],
+            dtype='float32',
+            default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+
+        self.rel_width = paddle.create_parameter(
+            shape=[(2 * win_size - 1), dim_head],
+            dtype='float32',
+            default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+
+    def forward(self, q):
+        B,BB,HW,_ = q.shape
+
+        # relative logits in width dimension
+        q = q.reshape([-1,self.block_size,self.block_size,self.dim_head])
+        rel_logits_w = rel_logits_1d(q,self.rel_width,permute_mask=[0,1,3,2,4])
+
+        # relative logits in height dimension
+        q = paddle.transpose(q,[0,2,1,3])
+        rel_logits_h = rel_logits_1d(q,self.rel_height,permute_mask=[0,3,1,4,2])
+
+        rel_logits = rel_logits_h+rel_logits_w
+        rel_logits = rel_logits.reshape([B,BB,HW,-1])
+
+        return rel_logits
+
+
+class HaloAttention(nn.Layer):
+    """
+    The internal dimensions of the attention module are controlled by
+    the interaction of several arguments.
+    the output dimension : dim_out
+    the value(v) dimension :  dim_out//num_heads
+    the query(q) and key(k) dimensions are determined by :
+         * num_heads*dim_head
+         * num_heads*(dim_out*attn_ratio//num_heads)
+    the ratio of q and k relative to the output : attn_ratio
+
+    Args:
+        dim (int): input dimension to the module
+        dim_out (int): output dimension of the module, same as dim if not set
+        feat_size (Tuple[int, int]): size of input feature_map (not used, for arg compat with bottle/lambda)
+        stride: output stride of the module, query downscaled if > 1 (default: 1).
+        num_heads: parallel attention heads (default: 8).
+        dim_head: dimension of query and key heads, calculated from dim_out * attn_ratio // num_heads if not set
+        block_size (int): size of blocks. (default: 8)
+        halo_size (int): size of halo overlap. (default: 3)
+        qk_ratio (float): ratio of q and k dimensions to output dimension when dim_head not set. (default: 1.0)
+        qkv_bias (bool) : add bias to q, k, and v projections
+        avg_down (bool): use average pool downsample instead of strided query blocks
+        scale_pos_embed (bool): scale the position embedding as well as Q @ K
+    """
+    def __init__(self,
+                 dim,
+                 dim_out=None,
+                 feat_size=None,
+                 stride=1,
+                 num_heads=8,
+                 dim_head=None,
+                 block_size=8,
+                 halo_size=3,
+                 qk_ratio=1.0,
+                 qkv_bias=False,
+                 avg_down=False,
+                 scale_pos_embed=False):
+
+        super().__init__()
+        dim_out = dim_out or dim
+        assert dim_out % num_heads == 0
+        self.stride = stride
+        self.num_heads = num_heads
+        self.dim_head_qk = make_divisible(dim_out * qk_ratio, divisor=8) // num_heads
+        self.dim_head_v = dim_out // self.num_heads
+        self.dim_out_qk = num_heads * self.dim_head_qk
+        self.dim_out_v = num_heads * self.dim_head_v
+        self.scale = self.dim_head_qk ** -0.5
+        self.scale_pos_embed = scale_pos_embed
+        self.block_size = self.block_size_ds = block_size
+        self.halo_size = halo_size
+        self.win_size = block_size + halo_size * 2  # neighbourhood window size
+        self.block_stride = stride
+        use_avg_pool = False
+        if stride > 1:
+            use_avg_pool = avg_down or block_size % stride != 0
+            self.block_stride = stride
+            self.block_size_ds = self.block_size // self.block_stride
+        self.q = nn.Conv2D(dim,
+                           self.dim_out_qk,
+                           1,
+                           stride=self.block_stride,
+                           bias_attr=qkv_bias,
+                           weight_attr=paddle.ParamAttr(initializer=nn.initializer.KaimingUniform()))
+        self.kv = nn.Conv2D(dim, self.dim_out_qk + self.dim_out_v, 1, bias_attr=qkv_bias)
+        self.pos_embed = RelPosEmb(
+            block_size=self.block_size_ds, win_size=self.win_size, dim_head=self.dim_head_qk)
+        self.pool = nn.AvgPool2D(2, 2) if use_avg_pool else Identity()
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        assert H % self.block_size == 0 and W % self.block_size == 0, 'fmap dimensions must be divisible by the block size'
+        num_h_blocks = H//self.block_size
+        num_w_blocks = W//self.block_size
+        num_blocks = num_h_blocks * num_w_blocks
+
+        q = self.q(x)
+        # unfold
+        q = q.reshape([-1,self.dim_head_qk,num_h_blocks,self.block_size_ds,num_w_blocks,self.block_size_ds])
+        q = paddle.transpose(q,[0,1,3,5,2,4])
+        q = q.reshape([B*self.num_heads,self.dim_head_qk,-1,num_blocks])
+        q = paddle.transpose(q,[0,3,2,1])  # B*num_heads,num_blocks,block_size**2, dim_head
+        kv = self.kv(x)
+
+        # # generate overlap windows for kv---solution 1 unfold.unfold
+        # kv = paddle.nn.functional.pad(kv, [self.halo_size, self.halo_size, self.halo_size, self.halo_size]) # [bs,dim_out,pad_H,pad_W]
+        # kv = self.kv_unfold(kv)  # dimension=2
+
+        # # another function to generate overlap windows for kv---solution 2 is xla
+        # WW = self.win_size ** 2
+        # pw = paddle.eye(WW, dtype=x.dtype).reshape([WW, 1, self.win_size, self.win_size])
+        # kv = paddle.nn.functional.conv2d(kv.reshape([-1, 1, H, W]), pw, stride=self.block_size, padding=self.halo_size)
+
+        # the other function to generate overlap between windows for kv --- solution 3
+        kv_unfold = nn.Unfold([self.win_size,self.win_size], strides=self.block_size, paddings=self.halo_size)
+        kv = kv_unfold(kv)
+
+        kv = kv.reshape([B * self.num_heads, self.dim_head_qk + self.dim_head_v, -1, num_blocks,])
+        kv = paddle.transpose(kv,[0, 3, 2, 1])
+
+        k, v = paddle.split(kv, [self.dim_head_qk, self.dim_head_v], axis=-1)
+        k = paddle.transpose(k,[0,1,3,2])
+
+        if self.scale_pos_embed:
+            attn = (q@k + self.pos_embed(q)) * self.scale
+        else:
+            pos_embed_q = self.pos_embed(q)
+            part_1 = (q @ k) * self.scale
+            attn = part_1 + pos_embed_q
+        # attn: B * num_heads, num_blocks, block_size ** 2, win_size ** 2
+        softmax_fn = nn.layer.Softmax(-1)
+        attn = softmax_fn(attn)
+        attn = attn @ v
+
+        out = paddle.transpose(attn,[0,3,2,1])  # B * num_heads, dim_head_v, block_size ** 2, num_blocks
+        # fold
+        out = out.reshape([-1, self.block_size_ds, self.block_size_ds, num_h_blocks, num_w_blocks])
+        out = paddle.transpose(out,[0, 3, 1, 4, 2])
+        out = out.reshape(
+            [B, self.dim_out_v, H // self.block_stride, W // self.block_stride])
+        # B, dim_out, H // block_stride, W // block_stride
+        out = self.pool(out)
+        return out
+
+
+class BottleneckBlock(nn.Layer):
+    """ ResNet-like Bottleneck Block - 1x1 - kxk - 1x1
+    """
+    def __init__(self,
+                 in_chs,
+                 out_chs,
+                 stride,
+                 act,
+                 downsample=None,
+                 shortcut=None,
+                 ):
+        super().__init__()
+
+        self.stride = stride
+        mid_chs = out_chs//4
+
+        self.conv1_1x1 = ConvBnAct(in_chs,
+                                   mid_chs,
+                                   kernel_size=1,
+                                   stride=1,
+                                   padding=0,
+                                   act=act)
+        self.conv2_kxk = ConvBnAct(mid_chs,
+                                   mid_chs,
+                                   kernel_size=3,
+                                   stride=self.stride,
+                                   padding=1,
+                                   act=act)
+        self.conv2b_kxk = Identity()
+        self.conv3_1x1 = ConvBnAct(mid_chs,
+                                   out_chs,
+                                   kernel_size=1,
+                                   stride=1,
+                                   padding=0)
+
+        self.attn = Identity()
+        self.attn_last = Identity()
+        self.shortcut = shortcut
+
+        if self.shortcut:
+            if downsample:
+                self.creat_shortcut = ConvBnAct(in_chs,
+                                                out_chs,
+                                                kernel_size=1,
+                                                stride=self.stride,
+                                                padding=0)
+            else:
+                self.creat_shortcut = ConvBnAct(in_chs,
+                                                out_chs,
+                                                kernel_size=1,
+                                                stride=1,
+                                                padding=0)
+
+        self.Identity = Identity()
+        self.act = ActLayer(act)
+
+    def forward(self, x):
+        h = x
+        x = self.conv1_1x1(x)
+        x = self.conv2_kxk(x)
+        x = self.conv2b_kxk(x)
+        x = self.attn(x)
+        x = self.conv3_1x1(x)
+        out = self.attn_last(x)
+        if self.shortcut:
+            h = self.creat_shortcut(h)
+        else:
+            h = self.Identity(h)
+        out = out + h
+        out = self.act(out)
+        return out
+
+
+class SelfAttnBlock(nn.Layer):
+    """ ResNet-like Bottleneck Block - 1x1 -kxk - self attn -1x1
+    """
+    def __init__(self,
+                 chs,
+                 num_heads,
+                 block_size,
+                 halo_size,
+                 act,
+                 stride=None,
+                 shortcut=None,
+                 hidden_chs=None,
+                 ):
+        super().__init__()
+        mid_chs = chs//4
+
+        if hidden_chs is None:
+            out_chs = chs
+        else:
+            out_chs = hidden_chs
+
+        if stride is None:
+            self.stride = 1
+        else:
+            self.stride = stride
+
+        self.conv1_1x1 = ConvBnAct(out_chs, mid_chs, kernel_size=1, stride=1, padding=0,act=act)
+        self.conv2_kxk = Identity()
+        self.conv3_1x1 = ConvBnAct(mid_chs, chs, kernel_size=1, stride=1, padding=0)
+
+        self.self_attn = HaloAttention(mid_chs,
+                                       dim_out=mid_chs,
+                                       block_size=block_size,
+                                       halo_size=halo_size,
+                                       num_heads=num_heads,
+                                       stride=self.stride)
+        self.post_attn = BatchNormAct2d(mid_chs,act=act)
+
+        self.shortcut = shortcut
+        if self.shortcut:
+            self.creat_shortcut = ConvBnAct(out_chs,
+                                            chs,
+                                            kernel_size=1,
+                                            stride=self.stride,
+                                            padding=0)
+        self.Identity = Identity()
+        self.act = ActLayer(act=act)
+
+    def forward(self, x):
+        h = x
+        out = self.conv1_1x1(x)
+        out = self.self_attn(out)
+        out = self.post_attn(out)
+        out = self.conv3_1x1(out)
+        if self.shortcut:
+            h = self.creat_shortcut(h)
+        else:
+            h = self.Identity(h)
+        out = out + h
+        out = self.act(out)
+        return out
+
+
+class HaloStage(nn.Layer):
+    """ Stage layers for HaloNet. Stage layers contains a number of Blocks.
+    """
+    def __init__(self,
+                 block_types,
+                 block_size,
+                 halo_size,
+                 depth,
+                 channel,
+                 out_channel,
+                 stride,
+                 num_head,
+                 act,
+                 hidden_chs=None,
+                 downsample=None,
+                 ):
+        super().__init__()
+
+        self.depth = depth
+
+        blocks = []
+
+        for idx in range(depth):
+            if idx == 0:
+                shortcut = True
+                in_channel = channel
+                if downsample is None:
+                    self.down = False
+                else:
+                    self.down = downsample
+                block_stride = stride
+                self.hidden = hidden_chs
+            else:
+                stride = 1
+                shortcut = False
+                in_channel = out_channel
+                self.down = False
+                block_stride = 1
+                self.hidden = None
+
+            block_type = block_types[idx]
+            if block_type == 'bottle':
+                blocks.append(
+                    BottleneckBlock(
+                        in_chs=in_channel,
+                        out_chs=out_channel,
+                        stride=block_stride,
+                        shortcut=shortcut,
+                        downsample=self.down,
+                        act=act,
+                    )
+                )
+
+            if block_type == 'attn':
+                if num_head > 0:
+                    blocks.append(
+                        SelfAttnBlock(
+                            chs=out_channel,
+                            stride=stride,
+                            num_heads=num_head,
+                            block_size=block_size,
+                            halo_size=halo_size,
+                            hidden_chs=self.hidden,
+                            shortcut=shortcut,
+                            act=act,
+                        )
+                )
+
+        self.blocks = nn.LayerList(blocks)
+
+    def forward(self, x):
+        for stage in self.blocks:
+            x = stage(x)
+        return x
+
+
+class HaloNet(nn.Layer):
+    """ Define main structure of HaloNet: stem - blocks - head
+    """
+    def __init__(self,
+                 depth_list,
+                 block_size,
+                 halo_size,
+                 stage1_block,
+                 stage2_block,
+                 stage3_block,
+                 stage4_block,
+                 chs_list,
+                 num_heads,
+                 num_classes,
+                 stride_list,
+                 hidden_chs,
+                 act,
+                 ):
+        super().__init__()
+        self.stem = Stem(act)
+        self.stage1 = HaloStage(
+                                block_types=stage1_block,
+                                block_size=block_size,
+                                halo_size=halo_size,
+                                depth=depth_list[0],
+                                channel=chs_list[0],
+                                out_channel=chs_list[1],
+                                stride=stride_list[0],
+                                num_head=num_heads[0],
+                                hidden_chs=hidden_chs,
+                                act=act,
+                                )
+        self.stage2 = HaloStage(
+                                block_types=stage2_block,
+                                block_size=block_size,
+                                halo_size=halo_size,
+                                depth=depth_list[1],
+                                channel=chs_list[1],
+                                out_channel=chs_list[2],
+                                stride=stride_list[1],
+                                num_head=num_heads[1],
+                                hidden_chs=hidden_chs,
+                                act=act,
+                                downsample=True)
+        self.stage3 = HaloStage(
+                                block_types=stage3_block,
+                                block_size=block_size,
+                                halo_size=halo_size,
+                                depth=depth_list[2],
+                                channel=chs_list[2],
+                                out_channel=chs_list[3],
+                                stride=stride_list[2],
+                                num_head=num_heads[2],
+                                hidden_chs=hidden_chs,
+                                act=act,
+                                downsample=True)
+        self.stage4 = HaloStage(
+                                block_types=stage4_block,
+                                block_size=block_size,
+                                halo_size=halo_size,
+                                depth=depth_list[3],
+                                channel=chs_list[3],
+                                out_channel=chs_list[4],
+                                stride=stride_list[3],
+                                num_head=num_heads[3],
+                                hidden_chs=hidden_chs,
+                                act=act,
+                                downsample=True)
+
+        w_attr_1, b_attr_1 = init_weights()
+        self.classifier = nn.Sequential(
+            SelectAdaptivePool2d(flatten=True),
+            nn.Linear(chs_list[4], num_classes, weight_attr=w_attr_1, bias_attr=b_attr_1),
+            Identity()
+        )
+
+    def forward(self, x):
+        x = self.stem(x)
+        out_stage1 = self.stage1(x)
+        out_stage2 = self.stage2(out_stage1)
+        out_stage3 = self.stage3(out_stage2)
+        out_stage4 = self.stage4(out_stage3)
+        out = self.classifier(out_stage4)
+        return out
+
+
+def build_halonet(config):
+    """ Build HaloNet by reading options in config object
+    :param config: config instance contains setting options
+    :return: HaloNet model
+    """
+    model = HaloNet(depth_list=config.MODEL.DEPTH,
+                    stage1_block=config.MODEL.STAGE1_BLOCK,
+                    stage2_block=config.MODEL.STAGE2_BLOCK,
+                    stage3_block=config.MODEL.STAGE3_BLOCK,
+                    stage4_block=config.MODEL.STAGE4_BLOCK,
+                    chs_list=config.MODEL.CHANNEL,
+                    num_heads=config.MODEL.NUM_HEAD,
+                    num_classes=config.MODEL.NUM_CLASSES,
+                    stride_list=config.MODEL.STRIDE,
+                    block_size=config.MODEL.BLOCK_SIZE,
+                    halo_size=config.MODEL.HALO_SIZE,
+                    hidden_chs=config.MODEL.HIDDEN_CHANNEL,
+                    act=config.MODEL.ACT,
+    )
+    return model
diff --git a/image_classification/HaloNet/img1.png b/image_classification/HaloNet/img1.png
new file mode 100644
index 00000000..69fd047a
Binary files /dev/null and b/image_classification/HaloNet/img1.png differ
diff --git a/image_classification/HaloNet/losses.py b/image_classification/HaloNet/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/HaloNet/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/HaloNet/main_multi_gpu.py b/image_classification/HaloNet/main_multi_gpu.py
new file mode 100755
index 00000000..6090bc78
--- /dev/null
+++ b/image_classification/HaloNet/main_multi_gpu.py
@@ -0,0 +1,584 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""HaloNet training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from halonet import build_halonet as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('HaloNet')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+
+        if amp is True:  # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:  # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+        filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+        logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+        )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED + '.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch + 1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch + 1}.")
+    for epoch in range(last_epoch + 1, config.TRAIN.NUM_EPOCHS + 1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val,), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HaloNet/main_single_gpu.py b/image_classification/HaloNet/main_single_gpu.py
new file mode 100755
index 00000000..0edb7149
--- /dev/null
+++ b/image_classification/HaloNet/main_single_gpu.py
@@ -0,0 +1,427 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""HaloNet training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from halonet import build_halonet as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('HaloNet')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+
+        if amp is True:  # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:  # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 6: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+        )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED + '.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch + 1}.")
+    for epoch in range(last_epoch + 1, config.TRAIN.NUM_EPOCHS + 1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HaloNet/mixup.py b/image_classification/HaloNet/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/HaloNet/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/HaloNet/port_weights/__init__.py b/image_classification/HaloNet/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/HaloNet/port_weights/load_halonet_pytorch_weights.py b/image_classification/HaloNet/port_weights/load_halonet_pytorch_weights.py
new file mode 100755
index 00000000..39c073e5
--- /dev/null
+++ b/image_classification/HaloNet/port_weights/load_halonet_pytorch_weights.py
@@ -0,0 +1,207 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+import timm
+from halonet import build_halonet
+import os
+from config import *
+import json
+
+# model_name = 'halonet_50ts_256'
+model_name = 'halonet_26t_256'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('head.fc','classifier.1'),
+        ('stem.conv1.conv', 'stem.conv1.conv'),
+        ('stem.conv1.bn', 'stem.conv1.bn'),
+        ('stem.conv2.conv', 'stem.conv2.conv'),
+        ('stem.conv2.bn', 'stem.conv2.bn'),
+        ('stem.conv3.conv', 'stem.conv3.conv'),
+        ('stem.conv3.bn', 'stem.conv3.bn'),
+    ]
+
+    # torch 'layers' to  paddle 'stages'
+    # depths = config.MODEL.TRANS.STAGE_DEPTHS
+    depths = [3,4,6,3]
+    num_stages = len(depths)
+    for stage_idx in range(num_stages):
+        th_s_prefix = f'stages.{stage_idx}'
+        pp_s_prefix = f'stage{stage_idx+1}.blocks'
+        for block_idx in range(depths[stage_idx]):
+            th_b_prefix = f'{th_s_prefix}.{block_idx}'
+            pp_b_prefix = f'{pp_s_prefix}.{block_idx}'
+            layer_mapping = [
+                (f'{th_b_prefix}.conv1_1x1.conv', f'{pp_b_prefix}.conv1_1x1.conv'),
+                (f'{th_b_prefix}.conv1_1x1.bn', f'{pp_b_prefix}.conv1_1x1.bn'),
+                (f'{th_b_prefix}.conv2_kxk.conv', f'{pp_b_prefix}.conv2_kxk.conv'),
+                (f'{th_b_prefix}.conv2_kxk.bn', f'{pp_b_prefix}.conv2_kxk.bn'),
+                (f'{th_b_prefix}.conv3_1x1.conv', f'{pp_b_prefix}.conv3_1x1.conv'),
+                (f'{th_b_prefix}.conv3_1x1.bn', f'{pp_b_prefix}.conv3_1x1.bn'),
+                (f'{th_b_prefix}.shortcut.conv', f'{pp_b_prefix}.creat_shortcut.conv'),
+                (f'{th_b_prefix}.shortcut.bn', f'{pp_b_prefix}.creat_shortcut.bn'),
+                (f'{th_b_prefix}.self_attn.q', f'{pp_b_prefix}.self_attn.q'),
+                (f'{th_b_prefix}.self_attn.kv', f'{pp_b_prefix}.self_attn.kv'),
+                (f'{th_b_prefix}.self_attn.pos_embed.height_rel', f'{pp_b_prefix}.self_attn.pos_embed.rel_height'),
+                (f'{th_b_prefix}.self_attn.pos_embed.width_rel', f'{pp_b_prefix}.self_attn.pos_embed.rel_width'),
+                (f'{th_b_prefix}.post_attn', f'{pp_b_prefix}.post_attn.bn'),
+            ]
+            mapping.extend(layer_mapping)
+
+    return mapping
+
+
+
+def convert(torch_model, paddle_model):
+    new_pd_params = []
+    def _set_value(th_name, pd_name, no_transpose=False):
+        if (pd_name == 'classifier.1.weight'):
+            th_shape = th_params[th_name].shape
+            pd_shape = tuple(pd_params[pd_name].shape)
+        if(th_name in th_params.keys() and pd_name in pd_params.keys()):
+            th_shape = th_params[th_name].shape
+            pd_shape = tuple(pd_params[pd_name].shape)  # paddle shape default type is list
+            # assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+            print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+            value = th_params[th_name].data.numpy()
+            if len(value.shape) == 2:
+                if not no_transpose:
+                    value = value.transpose((1, 0))
+            pd_params[pd_name].set_value(value)
+            new_pd_params.append(pd_name)
+        else:
+            print('%s not in th_params'%(th_name))
+            print('%s not in pd_params'%(pd_name))
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys():  # nn.Parameters
+            if th_name.endswith('height_rel'):
+                _set_value(th_name, pd_name, no_transpose=True)
+            elif th_name.endswith('width_rel'):
+                _set_value(th_name, pd_name, no_transpose=True)
+            else:
+                _set_value(th_name, pd_name)
+        else:  # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+            if f'{th_name}.running_mean' in th_params.keys():
+                th_name_b = f'{th_name}.running_mean'
+                pd_name_b = f'{pd_name}._mean'
+                _set_value(th_name_b, pd_name_b)
+
+            if f'{th_name}.running_var' in th_params.keys():
+                th_name_b = f'{th_name}.running_var'
+                pd_name_b = f'{pd_name}._variance'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model,new_pd_params
+
+
+def main():
+    paddle.set_device('cpu')
+    paddle_model = build_halonet(config)
+    paddle_model.eval()
+
+    print(paddle_model)
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    print('+++++++++++++++++++++++++++++++++++')
+    device = torch.device('cpu')
+    # torch_model = timm.create_model('halonet50ts',pretrained=True)
+    torch_model = timm.create_model('halonet26t', pretrained=True)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+    print(torch_model)
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+    print('+++++++++++++++++++++++++++++++++++')
+
+    # convert weights
+    paddle_model,new_pd_params_list = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_paddle = paddle_model(x_paddle)
+    out_paddle = out_paddle.cpu().numpy()
+
+    out_torch = torch_model(x_torch)
+    out_torch = out_torch.detach().cpu().numpy()
+
+    # for out_paddle,out_torch in zip(out_paddle_np,out_torch_np):
+    out_diff = np.allclose(out_torch, out_paddle, atol=1e-5)
+    print(out_diff)
+    print(np.sum(out_torch),np.sum(out_paddle))
+
+    assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+    
+    # save weights for paddle model
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/HaloNet/random_erasing.py b/image_classification/HaloNet/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/HaloNet/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/HaloNet/run_eval.sh b/image_classification/HaloNet/run_eval.sh
new file mode 100755
index 00000000..81685bc7
--- /dev/null
+++ b/image_classification/HaloNet/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/halonet_50ts_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./halonet_50ts_256'
diff --git a/image_classification/HaloNet/run_eval_multi.sh b/image_classification/HaloNet/run_eval_multi.sh
new file mode 100755
index 00000000..87e79ca4
--- /dev/null
+++ b/image_classification/HaloNet/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/halonet_50ts_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=256 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./halonet_50ts_256'
diff --git a/image_classification/HaloNet/run_train.sh b/image_classification/HaloNet/run_train.sh
new file mode 100755
index 00000000..f7596587
--- /dev/null
+++ b/image_classification/HaloNet/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/halonet_26t_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
diff --git a/image_classification/HaloNet/run_train_multi.sh b/image_classification/HaloNet/run_train_multi.sh
new file mode 100755
index 00000000..54dfa85c
--- /dev/null
+++ b/image_classification/HaloNet/run_train_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/halonet_50ts_256.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+#    -amp
diff --git a/image_classification/HaloNet/utils.py b/image_classification/HaloNet/utils.py
new file mode 100755
index 00000000..44800527
--- /dev/null
+++ b/image_classification/HaloNet/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/MAE/README.md b/image_classification/MAE/README.md
new file mode 100644
index 00000000..8db9f25b
--- /dev/null
+++ b/image_classification/MAE/README.md
@@ -0,0 +1,174 @@
+# TODO: This README should be modified
+# An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [arxiv](https://arxiv.org/abs/2010.11929) 
+
+PaddlePaddle training/validation code and pretrained models for **ViT**.
+
+The official TF implementation is [here](https://github.com/google-research/vision_transformer).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+
+<p align="center">
+<img src="./vit.png" alt="drawing" width="90%"/>
+<h4 align="center">ViT Model Overview</h4>
+</p>
+
+
+### Update 
+- Update (2021-09-27): More weights are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| vit_base_patch32_224          | 80.68 | 95.61 | 88.2M   | 4.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1DPEhEuu9sDdcmOPukQbR7ZcHq2bxx9cr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ppOLj5SWlJmA-NjoLCoYIw)(ubyr) |
+| vit_base_patch32_384          | 83.35 | 96.84 | 88.2M   | 12.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1nCOSwrDiFBFmTkLEThYwjL9SfyzkKoaf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jxnL00ocpmdiPM4fOu4lpg)(3c2f) |
+| vit_base_patch16_224          | 84.58 | 97.30 | 86.4M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/13D9FqU4ISsGxWXURgKW9eLOBV-pYPr-L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ms3o2fHMQpIoVqnEHitRtA)(qv4n) |
+| vit_base_patch16_384          | 85.99 | 98.00 | 86.4M   | 49.8G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kWKaAgneDx0QsECxtf7EnUdUZej6vSFT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15ggLdiL98RPcz__SXorrXA)(wsum) |
+| vit_large_patch16_224         | 85.81 | 97.82 | 304.1M  | 59.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1jgwtmtp_cDWEhZE-FuWhs7lCdpqhAMft/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1HRxUJAwEiKgrWnJSjHyU0A)(1bgk) |
+| vit_large_patch16_384         | 87.08 | 98.30 | 304.1M  | 175.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zfw5mdiIm-mPxxQddBFxt0xX-IR-PF2U/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KvxfIpMeitgXAUZGr5HV8A)(5t91) |
+| vit_large_patch32_384         | 81.51 | 96.09 | 306.5M  | 44.4G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Py1EX3E35jL7DComW-29Usg9788BB26j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1W8sUs0pObOGpohP4vsT05w)(ieg3) |
+| | | | | | | | | |
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./vit_base_patch16_224.pdparams`, to use the `vit_base_patch16_224` model in python:
+```python
+from config import get_config
+from transformer import build_vit as build_model
+# config files in ./configs/
+config = get_config('./configs/vit_base_patch16_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate ViT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/vit_base_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./vit_base_patch16_224.pdparams'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/vit_base_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./vit_base_patch16_224.pdparams'
+```
+
+</details>
+
+
+## Training
+To train the ViT model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg='./configs/vit_base_patch16_224.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/vit_base_patch16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+```
+
+</details>
+
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{dosovitskiy2020image,
+  title={An image is worth 16x16 words: Transformers for image recognition at scale},
+  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others},
+  journal={arXiv preprint arXiv:2010.11929},
+  year={2020}
+}
+```
diff --git a/image_classification/MAE/augment.py b/image_classification/MAE/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/MAE/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/MAE/config.py b/image_classification/MAE/config.py
new file mode 100644
index 00000000..2cec91aa
--- /dev/null
+++ b/image_classification/MAE/config.py
@@ -0,0 +1,184 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 256  # 256 # train batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8  # 64 # val batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/'  # path to dataset
+_C.DATA.DATASET = 'imagenet2012'  # dataset name
+_C.DATA.IMAGE_SIZE = 224  # input image size: 224 for pretrain, 384 for finetune
+# input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.CROP_PCT = 0.875
+_C.DATA.NUM_WORKERS = 4  # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'MAE'
+_C.MODEL.NAME = 'MAE'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.MAE_PRETRAIN = True
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PATCH_SIZE = 16
+_C.MODEL.TRANS.MLP_RATIO = 4.0
+_C.MODEL.TRANS.QKV_BIAS = True
+_C.MODEL.TRANS.MASK_RATIO = 0.75
+_C.MODEL.TRANS.ENCODER = CN()
+_C.MODEL.TRANS.ENCODER.DEPTH = 12
+_C.MODEL.TRANS.ENCODER.EMBED_DIM = 768
+_C.MODEL.TRANS.ENCODER.NUM_HEADS = 12
+_C.MODEL.TRANS.DECODER = CN()
+_C.MODEL.TRANS.DECODER.DEPTH = 8
+_C.MODEL.TRANS.DECODER.EMBED_DIM = 512
+_C.MODEL.TRANS.DECODER.NUM_HEADS = 8
+
+
+# training settings (for Vit-L/16 pretrain)
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 800
+_C.TRAIN.WARMUP_EPOCHS = 40  
+_C.TRAIN.WEIGHT_DECAY = 0.05  
+_C.TRAIN.BASE_LR = 1.5e-4  
+_C.TRAIN.WARMUP_START_LR = 1e-6  # 0.0
+_C.TRAIN.END_LR = 0.0
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1  
+_C.TRAIN.LINEAR_SCALED_LR = 256
+_C.TRAIN.NORMALIZE_TARGET = True
+
+# train augmentation (only for finetune)
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.RAND_AUGMENT = False
+_C.TRAIN.RAND_AUGMENT_LAYERS = 9
+_C.TRAIN.RAND_AUGMENT_MAGNITUDE = 5  # scale from 0 to 10
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90"  # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30  # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1  # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.95)  # same as MAE paper, for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1  # freq to save chpt
+_C.REPORT_FREQ = 100  # freq to logging info
+_C.VALIDATE_FREQ = 100  # freq to do validation
+_C.SEED = 0
+_C.EVAL = False  # run evaluation only
+_C.AMP = False  # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.mae_pretrain:
+        config.MODEL.MAE_PRETRAIN = args.mae_pretrain
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp:  # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    # config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/MAE/configs/vit_base_patch16_224_finetune.yaml b/image_classification/MAE/configs/vit_base_patch16_224_finetune.yaml
new file mode 100644
index 00000000..9cee1446
--- /dev/null
+++ b/image_classification/MAE/configs/vit_base_patch16_224_finetune.yaml
@@ -0,0 +1,42 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: MAE
+    NAME: vit_base_patch16_224
+    DROPPATH: 0.1
+    TRANS:
+        PATCH_SIZE: 16
+        MLP_RATIO: 4.0
+        QKV_BIAS: true
+        MASK_RATIO: 0.75
+        ENCODER:
+            EMBED_DIM: 768
+            DEPTH: 12
+            NUM_HEADS: 12
+
+TRAIN:
+    NUM_EPOCHS: 100
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 1e-3
+    WARMUP_START_LR: 1e-6
+    ACCUM_ITER: 2 # the total batch size should be 1024
+
+    LR_SCHEDULER:
+        NAME: 'warmupcosine'
+
+    OPTIMIZER:
+        NAME: 'AdamW'
+        BETAS: (0.9, 0.999)
+
+    SMOOTHING: 0.1
+    RAND_AUGMENT: True
+    RAND_AUGMENT_LAYERS: 9
+    RAND_AUGMENT_MAGNITUDE: 5
+    MIXUP_ALPHA: 0.8
+    MIXUP_PROB: 1.0
+    MIXUP_SWITCH_PROB: 0.5
+    MIXUP_MODE: 'batch'
+    CUTMIX_ALPHA: 1.0
+    CUTMIX_MINMAX: None
\ No newline at end of file
diff --git a/image_classification/MAE/configs/vit_base_patch16_224_pretrain.yaml b/image_classification/MAE/configs/vit_base_patch16_224_pretrain.yaml
new file mode 100644
index 00000000..5eb52f39
--- /dev/null
+++ b/image_classification/MAE/configs/vit_base_patch16_224_pretrain.yaml
@@ -0,0 +1,36 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: MAE
+    NAME: vit_base_patch16_224
+    DROPPATH: 0.0
+    MAE_PRETRAIN: True
+    TRANS:
+        PATCH_SIZE: 16
+        MLP_RATIO: 4.0
+        QKV_BIAS: true
+        MASK_RATIO: 0.75
+        ENCODER:
+            EMBED_DIM: 768
+            DEPTH: 12
+            NUM_HEADS: 12
+        DECODER:
+            EMBED_DIM: 512
+            DEPTH: 8
+            NUM_HEADS: 8
+TRAIN:
+    NUM_EPOCHS: 800
+    WARMUP_EPOCHS: 40
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 1.5e-4
+    WARMUP_START_LR: 1e-6
+    GRAD_CLIP: None
+    ACCUM_ITER: 2 # the total batch size should be 4096
+
+    LR_SCHEDULER:
+        NAME: 'warmupcosine'
+
+    OPTIMIZER:
+        NAME: 'AdamW'
+        BETAS: (0.9, 0.95)
diff --git a/image_classification/MAE/configs/vit_base_patch16_224_pretrain_dec1.yaml b/image_classification/MAE/configs/vit_base_patch16_224_pretrain_dec1.yaml
new file mode 100644
index 00000000..c4284444
--- /dev/null
+++ b/image_classification/MAE/configs/vit_base_patch16_224_pretrain_dec1.yaml
@@ -0,0 +1,37 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: MAE
+    NAME: vit_base_patch16_224_dec1
+    DROPPATH: 0.0
+    MAE_PRETRAIN: True
+    TRANS:
+        PATCH_SIZE: 16
+        MLP_RATIO: 4.0
+        QKV_BIAS: true
+        MASK_RATIO: 0.75
+        ENCODER:
+            EMBED_DIM: 768
+            DEPTH: 12
+            NUM_HEADS: 12
+        DECODER:
+            EMBED_DIM: 512
+            DEPTH: 1
+            NUM_HEADS: 8
+TRAIN:
+    NUM_EPOCHS: 800
+    WARMUP_EPOCHS: 40
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 1.5e-4
+    WARMUP_START_LR: 1e-6
+    GRAD_CLIP: None
+    ACCUM_ITER: 2 # 8gpus only have 2048 batch size, the total batch size should be 4096
+    LINEAR_SCALED_LR: None
+
+    LR_SCHEDULER:
+        NAME: 'warmupcosine'
+
+    OPTIMIZER:
+        NAME: 'AdamW'
+        BETAS: (0.9, 0.95)
diff --git a/image_classification/MAE/configs/vit_large_patch16_224_finetune.yaml b/image_classification/MAE/configs/vit_large_patch16_224_finetune.yaml
new file mode 100644
index 00000000..11136830
--- /dev/null
+++ b/image_classification/MAE/configs/vit_large_patch16_224_finetune.yaml
@@ -0,0 +1,42 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: MAE
+    NAME: vit_large_patch16_224
+    DROPPATH: 0.1
+    TRANS:
+        PATCH_SIZE: 16
+        MLP_RATIO: 4.0
+        QKV_BIAS: true
+        MASK_RATIO: 0.75
+        ENCODER:
+            EMBED_DIM: 768
+            DEPTH: 12
+            NUM_HEADS: 12
+
+TRAIN:
+    NUM_EPOCHS: 50
+    WARMUP_EPOCHS: 5
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 1e-3
+    WARMUP_START_LR: 1e-6
+    ACCUM_ITER: 2 # the total batch size should be 1024
+
+    LR_SCHEDULER:
+        NAME: 'warmupcosine'
+
+    OPTIMIZER:
+        NAME: 'AdamW'
+        BETAS: (0.9, 0.999)
+
+    SMOOTHING: 0.1
+    RAND_AUGMENT: True
+    RAND_AUGMENT_LAYERS: 9
+    RAND_AUGMENT_MAGNITUDE: 5
+    MIXUP_ALPHA: 0.8
+    MIXUP_PROB: 1.0
+    MIXUP_SWITCH_PROB: 0.5
+    MIXUP_MODE: 'batch'
+    CUTMIX_ALPHA: 1.0
+    CUTMIX_MINMAX: None
\ No newline at end of file
diff --git a/image_classification/MAE/configs/vit_large_patch16_224_pretrain.yaml b/image_classification/MAE/configs/vit_large_patch16_224_pretrain.yaml
new file mode 100644
index 00000000..04b5e086
--- /dev/null
+++ b/image_classification/MAE/configs/vit_large_patch16_224_pretrain.yaml
@@ -0,0 +1,36 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: MAE
+    NAME: vit_large_patch16_224
+    DROPPATH: 0.0
+    MAE_PRETRAIN: True
+    TRANS:
+        PATCH_SIZE: 16
+        MLP_RATIO: 4.0
+        QKV_BIAS: true
+        MASK_RATIO: 0.75
+        ENCODER:
+            EMBED_DIM: 768
+            DEPTH: 12
+            NUM_HEADS: 12
+        DECODER:
+            EMBED_DIM: 512
+            DEPTH: 8
+            NUM_HEADS: 8
+TRAIN:
+    NUM_EPOCHS: 800
+    WARMUP_EPOCHS: 40
+    WEIGHT_DECAY: 0.05
+    BASE_LR: 1.5e-4
+    WARMUP_START_LR: 1e-6
+    GRAD_CLIP: None
+    ACCUM_ITER: 2 # the total batch size should be 4096
+
+    LR_SCHEDULER:
+        NAME: 'warmupcosine'
+
+    OPTIMIZER:
+        NAME: 'AdamW'
+        BETAS: (0.9, 0.95)
\ No newline at end of file
diff --git a/image_classification/MAE/datasets.py b/image_classification/MAE/datasets.py
new file mode 100644
index 00000000..1d6c17d3
--- /dev/null
+++ b/image_classification/MAE/datasets.py
@@ -0,0 +1,245 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from masking_generator import RandomMaskingGenerator
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+
+        if isinstance(transform, tuple):
+            # training: transform = [transform, mask_generator]
+            self.transform = transform[0]
+            self.mask_generator = transform[1] # if mae finetune, mask_generator is None
+        else:
+            # val: transform = transform
+            self.transform = transform
+            self.mask_generator = None
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        if self.mask_generator is not None:
+            mask = self.mask_generator()
+        else:
+            mask = None
+
+        if mask is None:
+            label = self.label_list[index]
+            return data, label
+
+        return data, mask
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+
+    if config.MODEL.MAE_PRETRAIN:
+        # for MAE pretraining
+        mask_generator = RandomMaskingGenerator(
+            input_size=config.DATA.IMAGE_SIZE // config.MODEL.TRANS.PATCH_SIZE,
+            mask_ratio=config.MODEL.TRANS.MASK_RATIO)
+    else:
+        mask_generator = None
+
+    return (transforms_train, mask_generator)
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/MAE/droppath.py b/image_classification/MAE/droppath.py
new file mode 100644
index 00000000..25b8d5ff
--- /dev/null
+++ b/image_classification/MAE/droppath.py
@@ -0,0 +1,60 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor #divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    out = dp(tmp)
+#    print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/MAE/losses.py b/image_classification/MAE/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/MAE/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/MAE/main_multi_gpu_finetune.py b/image_classification/MAE/main_multi_gpu_finetune.py
new file mode 100644
index 00000000..a6ace004
--- /dev/null
+++ b/image_classification/MAE/main_multi_gpu_finetune.py
@@ -0,0 +1,580 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MAE finetuning/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from transformer import build_mae_finetune as build_model
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-mae_pretrain', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_train = len(dataloader_train)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip())
+                      for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+        )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(
+                f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                local_logger.info(f"----- Save model: {model_path}.pdparams")
+                local_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if local_rank == 0:
+                    master_logger.info(f"----- Save model: {model_path}.pdparams")
+                    master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    dataset_train = get_dataset(config, mode='train')
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MAE/main_multi_gpu_pretrain.py b/image_classification/MAE/main_multi_gpu_pretrain.py
new file mode 100644
index 00000000..d1789ddf
--- /dev/null
+++ b/image_classification/MAE/main_multi_gpu_pretrain.py
@@ -0,0 +1,417 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MEA pre-training using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from transformer import build_mae_pretrain as build_model
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('MAE')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-mae_pretrain', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          patch_size,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          normalize_target=True,
+          debug_steps=100,
+          accum_iter=1,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        patch_size: int/tuple, image patch size
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        normalize_target: bool, if True, tokens are normalized by itself, default: True
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        images = data[0]
+        masks = paddle.to_tensor(data[1], dtype='bool')
+
+        with paddle.no_grad():
+            mean = paddle.to_tensor([0.485, 0.456, 0.406]).reshape([1, 3, 1, 1])
+            std = paddle.to_tensor([0.229, 0.224, 0.225]).reshape([1, 3, 1, 1])
+            unnorm_images = images * std + mean
+            B, C, H, W = images.shape
+            if normalize_target:
+                images_patch = unnorm_images.reshape([B, C, H//patch_size, patch_size, W//patch_size, patch_size])
+                images_patch = images_patch.transpose([0, 2, 4, 3, 5, 1])
+                images_patch = unnorm_images.reshape([B, -1, patch_size * patch_size, C])
+                images_patch = (images_patch - images_patch.mean(axis=-2, keepdim=True)) / (
+                    images_patch.var(axis=-2, keepdim=True).sqrt() + 1e-6) 
+                images_patch = images_patch.flatten(-2)
+            else:
+                images_patch = unnorm_images.reshape([B, C, H//patch_size, patch_size, W//patch_size, patch_size])
+                images_patch = images_patch.transpose([0, 2, 4, 3, 5, 1])
+                images_patch = unnorm_images.reshape([B, -1, patch_size * patch_size, C])
+                images_patch = images_patch.flatten(-2)
+
+            B, _, C = images_patch.shape
+            labels = images_patch[masks[:, 1:]].reshape([B, -1, C])
+
+        if amp is True:
+            with paddle.amp.auto_cast():
+                reconstructed_patches = model(images, masks)
+                loss = criterion(reconstructed_patches, labels)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:
+            reconstructed_patches = model(images, masks)
+            loss = criterion(reconstructed_patches, labels)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        batch_size = paddle.to_tensor(images.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            master_train_loss_meter.avg,
+            train_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train = args[1]
+    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+    total_batch_train = len(dataloader_train)
+    local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+
+    # STEP 3: Define criterion
+    criterion = nn.MSELoss()
+
+    # STEP 4: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip())
+                      for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+        )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 5: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(
+                f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 6: Start training (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+
+        train_loss,avg_loss, train_time = train(
+            dataloader=dataloader_train,
+            patch_size=config.MODEL.TRANS.PATCH_SIZE,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                local_logger.info(f"----- Save model: {model_path}.pdparams")
+                local_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if local_rank == 0:
+                    master_logger.info(f"----- Save model: {model_path}.pdparams")
+                    master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    dataset_train = get_dataset(config, mode='train')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MAE/main_single_gpu_finetune.py b/image_classification/MAE/main_single_gpu_finetune.py
new file mode 100644
index 00000000..ea267943
--- /dev/null
+++ b/image_classification/MAE/main_single_gpu_finetune.py
@@ -0,0 +1,403 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ViT finetuning/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from transformer import build_mae_finetune as build_model
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-mae_pretrain', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+
+        else:
+            output = model(image)
+            loss = criterion(output, label)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # 0. Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # 1. Create model
+    model = build_model(config)
+    # 2. Create train dataloader
+    dataset_train = get_dataset(config, mode='train')
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+    # 3. Define Mixup function and criterion
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+    # 4. Define lr_scheduler
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+    # 5. Define optimizer
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+    # 6. Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MAE/main_single_gpu_pretrain.py b/image_classification/MAE/main_single_gpu_pretrain.py
new file mode 100644
index 00000000..cf315a42
--- /dev/null
+++ b/image_classification/MAE/main_single_gpu_pretrain.py
@@ -0,0 +1,308 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MAE pre-training using single GPU, this is just a demo, we recommand using multi-gpu version"""
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from transformer import build_mae_pretrain as build_model
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('MAE')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-mae_pretrain', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          patch_size,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          normalize_target=True,
+          debug_steps=100,
+          accum_iter=1,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        images = data[0]
+        masks = paddle.to_tensor(data[1], dtype='bool')
+
+        with paddle.no_grad():
+            mean = paddle.to_tensor([0.485, 0.456, 0.406]).reshape([1, 3, 1, 1])
+            std = paddle.to_tensor([0.229, 0.224, 0.225]).reshape([1, 3, 1, 1])
+            unnorm_images = images * std + mean
+            B, C, H, W = images.shape
+            if normalize_target:
+                images_patch = unnorm_images.reshape([B, C, H // patch_size, patch_size, W // patch_size, patch_size])
+                images_patch = images_patch.transpose([0, 2, 4, 3, 5, 1])
+                images_patch = images_patch.reshape([B, -1, patch_size * patch_size, C])
+                images_patch = (images_patch - images_patch.mean(axis=-2, keepdim=True)) / (
+                        images_patch.var(axis=-2, keepdim=True).sqrt() + 1e-6)
+                images_patch = images_patch.flatten(-2)
+            else:
+                images_patch = unnorm_images.reshape([B, C, H//patch_size, patch_size, W//patch_size, patch_size])
+                images_patch = images_patch.transpose([0, 2, 4, 3, 5, 1])
+                images_patch = images_patch.reshape([B, -1, patch_size * patch_size, C])
+                images_patch = images_patch.flatten(-2)
+
+            B, _, C = images_patch.shape
+            labels = images_patch[masks[:, 1:]].reshape([B, -1, C])
+
+        if amp is True:
+            with paddle.amp.auto_cast():
+                reconstructed_patches = model(images, masks)
+                loss = criterion(reconstructed_patches, labels)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:
+            reconstructed_patches = model(images, masks)
+            loss = criterion(reconstructed_patches, labels)
+            # NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            # loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id + 1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        batch_size = images.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_time
+
+
+def main():
+    # 0. Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # 1. Create model
+    model = build_model(config)
+    # 2. Create train dataloader
+    dataset_train = get_dataset(config, mode='train')
+    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    # 3. Define criterion
+    criterion = nn.MSELoss()
+    # 4. Define lr_scheduler
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+    # 5. Define optimizer
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+    # 6. Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # 7. Start training and validation
+    logging.info(f"Start training from epoch {last_epoch + 1}.")
+    for epoch in range(last_epoch + 1, config.TRAIN.NUM_EPOCHS + 1):
+        # train
+        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_time = train(dataloader=dataloader_train,
+                                       patch_size=config.MODEL.TRANS.PATCH_SIZE,
+                                       model=model,
+                                       criterion=criterion,
+                                       optimizer=optimizer,
+                                       epoch=epoch,
+                                       total_epochs=config.TRAIN.NUM_EPOCHS,
+                                       total_batch=len(dataloader_train),
+                                       normalize_target=config.TRAIN.NORMALIZE_TARGET,
+                                       debug_steps=config.REPORT_FREQ,
+                                       accum_iter=config.TRAIN.ACCUM_ITER,
+                                       amp=config.AMP,
+                                       logger=logger)
+        scheduler.step()
+
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        # No need to do validation during pretraining
+
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MAE/masking_generator.py b/image_classification/MAE/masking_generator.py
new file mode 100644
index 00000000..9271dd4e
--- /dev/null
+++ b/image_classification/MAE/masking_generator.py
@@ -0,0 +1,50 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+random mask generator for MAE pretraining
+"""
+
+import random
+import math
+import numpy as np
+
+class RandomMaskingGenerator:
+    def __init__(self, input_size, mask_ratio, with_cls_token=True):
+        if not isinstance(input_size, tuple):
+            input_size = (input_size, ) * 2
+
+        self.height  = input_size[0]
+        self.width = input_size[1]
+        self.num_patches = self.height * self.width
+        self.num_mask = int(mask_ratio * self.num_patches)
+        self.with_cls_token = with_cls_token
+
+    def __call__(self):
+        mask = np.hstack([np.zeros(self.num_patches - self.num_mask),
+                          np.ones(self.num_mask)])
+        np.random.shuffle(mask)
+        if self.with_cls_token:
+            mask = np.insert(mask, 0, 0)
+        return mask
+
+
+#def main():
+#    rmg = RandomMaskingGenerator(input_size=32, mask_ratio=0.75)
+#    mask = rmg()
+#    for v in mask:
+#        print(v, end=', ')
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/MAE/mixup.py b/image_classification/MAE/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/MAE/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/MAE/nohup.out b/image_classification/MAE/nohup.out
new file mode 100644
index 00000000..6e00dda7
--- /dev/null
+++ b/image_classification/MAE/nohup.out
@@ -0,0 +1,9507 @@
+Traceback (most recent call last):
+  File "main_multi_gpu_pretrain.py", line 24, in <module>
+    import paddle
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/__init__.py", line 25, in <module>
+    from .fluid import monkey_patch_variable
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/__init__.py", line 45, in <module>
+    from . import dataset
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataset.py", line 19, in <module>
+    from ..utils import deprecated
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/utils/__init__.py", line 26, in <module>
+    from . import download  # noqa: F401
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/utils/download.py", line 23, in <module>
+    import requests
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/requests/__init__.py", line 112, in <module>
+    from . import utils
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/requests/utils.py", line 24, in <module>
+    from . import certs
+  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
+  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
+  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
+  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
+  File "<frozen importlib._bootstrap_external>", line 764, in get_code
+  File "<frozen importlib._bootstrap_external>", line 833, in get_data
+KeyboardInterrupt
+merging config from ./configs/vit_base_patch16_224_pretrain_dec1.yaml
+----- Imagenet2012 image train list len = 1281167
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:30053', '127.0.0.1:54949', '127.0.0.1:41862', '127.0.0.1:28777', '127.0.0.1:55177', '127.0.0.1:18423', '127.0.0.1:46681']
+I1219 16:59:41.631045 23562 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:30053 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:54949', '127.0.0.1:41862', '127.0.0.1:28777', '127.0.0.1:55177', '127.0.0.1:18423', '127.0.0.1:46681']
+I1219 16:59:44.247634 23580 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:54949 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:41862', '127.0.0.1:28777', '127.0.0.1:55177', '127.0.0.1:18423', '127.0.0.1:46681']
+I1219 16:59:46.636570 23595 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:41862 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:28777', '127.0.0.1:55177', '127.0.0.1:18423', '127.0.0.1:46681']
+I1219 16:59:48.816335 23610 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:28777 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:55177', '127.0.0.1:18423', '127.0.0.1:46681']
+I1219 16:59:51.517431 23627 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:55177 successful.
+I1219 16:59:53.801396 23642 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:18423 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:46681']
+I1219 16:59:56.182962 23659 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:46681 successful.
+I1219 16:59:56.935767 23580 nccl_context.cc:74] init nccl context nranks: 8 local rank: 2 gpu id: 2 ring id: 0
+I1219 16:59:56.935765 23562 nccl_context.cc:74] init nccl context nranks: 8 local rank: 1 gpu id: 1 ring id: 0
+I1219 16:59:56.935781 23627 nccl_context.cc:74] init nccl context nranks: 8 local rank: 5 gpu id: 5 ring id: 0
+I1219 16:59:56.935775 23595 nccl_context.cc:74] init nccl context nranks: 8 local rank: 3 gpu id: 3 ring id: 0
+I1219 16:59:56.935791 23642 nccl_context.cc:74] init nccl context nranks: 8 local rank: 6 gpu id: 6 ring id: 0
+I1219 16:59:56.935806 23610 nccl_context.cc:74] init nccl context nranks: 8 local rank: 4 gpu id: 4 ring id: 0
+I1219 16:59:56.935818 23659 nccl_context.cc:74] init nccl context nranks: 8 local rank: 7 gpu id: 7 ring id: 0
+I1219 16:59:56.935837 23545 nccl_context.cc:74] init nccl context nranks: 8 local rank: 0 gpu id: 0 ring id: 0
+W1219 17:00:00.904070 23545 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.904078 23562 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.904153 23595 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.904173 23610 device_context.cc:447] Please NOTE: device: 4, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.904186 23659 device_context.cc:447] Please NOTE: device: 7, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.904246 23642 device_context.cc:447] Please NOTE: device: 6, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.904264 23627 device_context.cc:447] Please NOTE: device: 5, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.906248 23580 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:00:00.957355 23562 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1219 17:00:00.957355 23659 device_context.cc:465] device: 7, cuDNN Version: 7.6.
+W1219 17:00:00.957358 23595 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1219 17:00:00.957360 23545 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+W1219 17:00:00.957374 23610 device_context.cc:465] device: 4, cuDNN Version: 7.6.
+W1219 17:00:00.957383 23642 device_context.cc:465] device: 6, cuDNN Version: 7.6.
+W1219 17:00:00.957394 23580 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+W1219 17:00:00.957394 23627 device_context.cc:465] device: 5, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 8, local_rank = 6
+INFO:local_logger:----- world_size = 8, local_rank = 3
+INFO:master_logger:
+AMP: False
+BASE: ['']
+DATA:
+  BATCH_SIZE: 256
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 4
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.1
+  DROPOUT: 0.1
+  DROPPATH: 0.0
+  MAE_PRETRAIN: True
+  NAME: vit_base_patch16_224_dec1
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  TRANS:
+    DECODER:
+      DEPTH: 1
+      EMBED_DIM: 512
+      NUM_HEADS: 8
+    ENCODER:
+      DEPTH: 12
+      EMBED_DIM: 768
+      NUM_HEADS: 12
+    MASK_RATIO: 0.75
+    MLP_RATIO: 4.0
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: MAE
+NGPUS: 8
+REPORT_FREQ: 100
+SAVE: ./output/train-20211219-16-59-32
+SAVE_FREQ: 1
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.00015
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  END_LR: 0.0005
+  GRAD_CLIP: 1
+  LAST_EPOCH: 0
+  LINEAR_SCALED_LR: None
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  NORMALIZE_TARGET: True
+  NUM_EPOCHS: 800
+  OPTIMIZER:
+    BETAS: (0.9, 0.95)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RAND_AUGMENT: False
+  RAND_AUGMENT_LAYERS: 9
+  RAND_AUGMENT_MAGNITUDE: 5
+  SMOOTHING: 0.1
+  WARMUP_EPOCHS: 40
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 100
+INFO:local_logger:----- world_size = 8, local_rank = 0
+INFO:master_logger:----- world_size = 8, local_rank = 0
+INFO:local_logger:----- world_size = 8, local_rank = 7
+INFO:local_logger:----- world_size = 8, local_rank = 5
+INFO:local_logger:----- world_size = 8, local_rank = 1
+INFO:local_logger:----- world_size = 8, local_rank = 2
+INFO:local_logger:----- world_size = 8, local_rank = 4
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:master_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:master_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ ERROR: Unexpected BUS error encountered in DataLoader worker. This might be caused by insufficient shared memory (shm), please check whether use_shared_memory is set and storage space in /dev/shm is enough
+ Exception in thread Thread-1:
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 583, in _get_data
+    data = self._data_queue.get(timeout=self._timeout)
+  File "/opt/conda/envs/py36/lib/python3.6/multiprocessing/queues.py", line 105, in get
+    raise Empty
+queue.Empty
+
+During handling of the above exception, another exception occurred:
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
+    self.run()
+  File "/opt/conda/envs/py36/lib/python3.6/threading.py", line 864, in run
+    self._target(*self._args, **self._kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 505, in _thread_loop
+    batch = self._get_data()
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 599, in _get_data
+    "pids: {}".format(len(failed_workers), pids))
+RuntimeError: DataLoader 1 workers exit unexpectedly, pids: 23832
+
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+No stack trace in paddle, may be caused by external reasons.
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1639904442 (unix time) try "date -d @1639904442" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x5be5) received by PID 23545 (TID 0x7f5dda7df700) from PID 23525 ***]
+
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
+  len(cache))
+Traceback (most recent call last):
+  File "main_multi_gpu_pretrain.py", line 416, in <module>
+    main()
+  File "main_multi_gpu_pretrain.py", line 412, in main
+    dist.spawn(main_worker, args=(config, dataset_train, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 3 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/main_multi_gpu_pretrain.py", line 368, in main_worker
+    master_logger=master_logger)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/main_multi_gpu_pretrain.py", line 157, in train
+    reconstructed_patches = model(images, masks)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/parallel.py", line 695, in forward
+    outputs = self._layers(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/transformer.py", line 537, in forward
+    enc_out = self.encoder(no_mask_x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/transformer.py", line 364, in forward
+    x = layer(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/transformer.py", line 310, in forward
+    x = self.mlp(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/transformer.py", line 245, in forward
+    x = self.fc1(x)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 914, in __call__
+    outputs = self.forward(*inputs, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/nn/layer/common.py", line 172, in forward
+    x=input, weight=self.weight, bias=self.bias, name=self.name)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/nn/functional/common.py", line 1474, in linear
+    False)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/multiprocess_utils.py", line 134, in __handler__
+    core._throw_error_if_process_failed()
+SystemError: (Fatal) DataLoader process (pid   1. If run DataLoader by DataLoader.from_generator(...), queue capacity is set by from_generator(..., capacity=xx, ...).
+  2. If run DataLoader by DataLoader(dataset, ...), queue capacity is set as 2 times of the max value of num_workers and len(places).
+  3. If run by DataLoader(dataset, ..., use_shared_memory=True), set use_shared_memory=False for not using shared memory.) exited is killed by signal: 23723.
+  It may be caused by insufficient shared storage space. This problem usually occurs when using docker as a development environment.
+  Please use command `df -h` to check the storage space of `/dev/shm`. Shared storage space needs to be greater than (DataLoader Num * DataLoader queue capacity * 1 batch data size).
+  You can solve this problem by increasing the shared storage space or reducing the queue capacity appropriately.
+Bus error (at /paddle/paddle/fluid/imperative/data_loader.cc:177)
+
+
+merging config from ./configs/vit_base_patch16_224_pretrain_dec1.yaml
+----- Imagenet2012 image train list len = 1281167
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:58819', '127.0.0.1:34756', '127.0.0.1:44071', '127.0.0.1:12661', '127.0.0.1:44311', '127.0.0.1:14139', '127.0.0.1:51679']
+I1219 17:02:09.309500 24382 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:58819 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:34756', '127.0.0.1:44071', '127.0.0.1:12661', '127.0.0.1:44311', '127.0.0.1:14139', '127.0.0.1:51679']
+I1219 17:02:11.901250 24397 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:34756 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:44071', '127.0.0.1:12661', '127.0.0.1:44311', '127.0.0.1:14139', '127.0.0.1:51679']
+I1219 17:02:14.341609 24414 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:44071 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:12661', '127.0.0.1:44311', '127.0.0.1:14139', '127.0.0.1:51679']
+I1219 17:02:17.001890 24429 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:12661 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:44311', '127.0.0.1:14139', '127.0.0.1:51679']
+I1219 17:02:19.379423 24447 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:44311 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:14139', '127.0.0.1:51679']
+I1219 17:02:22.029084 24463 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:14139 successful.
+I1219 17:02:24.569348 24481 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:51679 successful.
+I1219 17:02:24.931157 24382 nccl_context.cc:74] init nccl context nranks: 8 local rank: 1 gpu id: 1 ring id: 0
+I1219 17:02:24.931161 24397 nccl_context.cc:74] init nccl context nranks: 8 local rank: 2 gpu id: 2 ring id: 0
+I1219 17:02:24.931192 24414 nccl_context.cc:74] init nccl context nranks: 8 local rank: 3 gpu id: 3 ring id: 0
+I1219 17:02:24.931200 24429 nccl_context.cc:74] init nccl context nranks: 8 local rank: 4 gpu id: 4 ring id: 0
+I1219 17:02:24.931208 24447 nccl_context.cc:74] init nccl context nranks: 8 local rank: 5 gpu id: 5 ring id: 0
+I1219 17:02:24.931213 24463 nccl_context.cc:74] init nccl context nranks: 8 local rank: 6 gpu id: 6 ring id: 0
+I1219 17:02:24.931216 24481 nccl_context.cc:74] init nccl context nranks: 8 local rank: 7 gpu id: 7 ring id: 0
+I1219 17:02:24.931238 24365 nccl_context.cc:74] init nccl context nranks: 8 local rank: 0 gpu id: 0 ring id: 0
+W1219 17:02:28.374552 24365 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.374681 24397 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.374711 24414 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.374712 24429 device_context.cc:447] Please NOTE: device: 4, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.374729 24447 device_context.cc:447] Please NOTE: device: 5, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.374773 24382 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.374810 24463 device_context.cc:447] Please NOTE: device: 6, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.376953 24481 device_context.cc:447] Please NOTE: device: 7, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:02:28.382552 24414 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1219 17:02:28.382556 24365 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+W1219 17:02:28.382561 24447 device_context.cc:465] device: 5, cuDNN Version: 7.6.
+W1219 17:02:28.382565 24397 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+W1219 17:02:28.382582 24463 device_context.cc:465] device: 6, cuDNN Version: 7.6.
+W1219 17:02:28.382568 24429 device_context.cc:465] device: 4, cuDNN Version: 7.6.
+W1219 17:02:28.382580 24382 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1219 17:02:28.382681 24481 device_context.cc:465] device: 7, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 8, local_rank = 1
+INFO:local_logger:----- world_size = 8, local_rank = 5
+INFO:local_logger:----- world_size = 8, local_rank = 3
+INFO:local_logger:----- world_size = 8, local_rank = 2
+INFO:local_logger:----- world_size = 8, local_rank = 7
+INFO:local_logger:----- world_size = 8, local_rank = 6
+INFO:master_logger:
+AMP: False
+BASE: ['']
+DATA:
+  BATCH_SIZE: 256
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 4
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.1
+  DROPOUT: 0.1
+  DROPPATH: 0.0
+  MAE_PRETRAIN: True
+  NAME: vit_base_patch16_224_dec1
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  TRANS:
+    DECODER:
+      DEPTH: 1
+      EMBED_DIM: 512
+      NUM_HEADS: 8
+    ENCODER:
+      DEPTH: 12
+      EMBED_DIM: 768
+      NUM_HEADS: 12
+    MASK_RATIO: 0.75
+    MLP_RATIO: 4.0
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: MAE
+NGPUS: 8
+REPORT_FREQ: 100
+SAVE: ./output/train-20211219-17-02-00
+SAVE_FREQ: 1
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.00015
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  END_LR: 0.0005
+  GRAD_CLIP: 1
+  LAST_EPOCH: 0
+  LINEAR_SCALED_LR: None
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  NORMALIZE_TARGET: True
+  NUM_EPOCHS: 800
+  OPTIMIZER:
+    BETAS: (0.9, 0.95)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RAND_AUGMENT: False
+  RAND_AUGMENT_LAYERS: 9
+  RAND_AUGMENT_MAGNITUDE: 5
+  SMOOTHING: 0.1
+  WARMUP_EPOCHS: 40
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 100
+INFO:local_logger:----- world_size = 8, local_rank = 0
+INFO:master_logger:----- world_size = 8, local_rank = 0
+INFO:local_logger:----- world_size = 8, local_rank = 4
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:master_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:master_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1452
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1431
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1469
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1481
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1408
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1501
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1475
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1440
+INFO:master_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1457
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+No stack trace in paddle, may be caused by external reasons.
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1639904603 (unix time) try "date -d @1639904603" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x5f17) received by PID 24365 (TID 0x7f5d5ca46700) from PID 24343 ***]
+
+Traceback (most recent call last):
+  File "main_multi_gpu_pretrain.py", line 416, in <module>
+    main()
+  File "main_multi_gpu_pretrain.py", line 412, in main
+    dist.spawn(main_worker, args=(config, dataset_train, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 330, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 1 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 261, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/main_multi_gpu_pretrain.py", line 368, in main_worker
+    master_logger=master_logger)
+  File "/workspace/ppvit_github/PaddleViT_raw/PaddleViT/image_classification/MAE/main_multi_gpu_pretrain.py", line 163, in train
+    loss.backward()
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/decorator.py", line 232, in fun
+    return caller(func, *(extras + args), **kw)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
+    return wrapped_func(*args, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/framework.py", line 229, in __impl__
+    return func(*args, **kwargs)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 239, in backward
+    framework._dygraph_tracer())
+OSError: (External) ResourceExhaustedError: 
+
+Out of memory error on GPU 1. Cannot allocate 394.000244MB memory on GPU 1, 15.719788GB memory has been allocated and available memory is only 63.437500MB.
+
+Please check whether there is any other process using GPU 1.
+1. If yes, please stop them, or start PaddlePaddle on another GPU.
+2. If no, please decrease the batch size of your model. 
+
+ (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
+ (at /paddle/paddle/fluid/imperative/basic_engine.cc:568)
+
+
+merging config from ./configs/vit_base_patch16_224_pretrain_dec1.yaml
+----- Imagenet2012 image train list len = 1281167
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:45480', '127.0.0.1:58605', '127.0.0.1:23406', '127.0.0.1:16014', '127.0.0.1:60086', '127.0.0.1:60603', '127.0.0.1:46782']
+I1219 17:07:49.286090 25456 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:45480 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:58605', '127.0.0.1:23406', '127.0.0.1:16014', '127.0.0.1:60086', '127.0.0.1:60603', '127.0.0.1:46782']
+I1219 17:07:51.690086 25473 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:58605 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:23406', '127.0.0.1:16014', '127.0.0.1:60086', '127.0.0.1:60603', '127.0.0.1:46782']
+I1219 17:07:54.058967 25488 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:23406 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:16014', '127.0.0.1:60086', '127.0.0.1:60603', '127.0.0.1:46782']
+I1219 17:07:57.064612 25503 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:16014 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:60086', '127.0.0.1:60603', '127.0.0.1:46782']
+I1219 17:07:59.496040 25520 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:60086 successful.
+server not ready, wait 3 sec to retry...
+not ready endpoints:['127.0.0.1:60603', '127.0.0.1:46782']
+I1219 17:08:02.203279 25537 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:60603 successful.
+I1219 17:08:04.597697 25554 gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:46782 successful.
+I1219 17:08:05.017540 25473 nccl_context.cc:74] init nccl context nranks: 8 local rank: 2 gpu id: 2 ring id: 0
+I1219 17:08:05.017537 25456 nccl_context.cc:74] init nccl context nranks: 8 local rank: 1 gpu id: 1 ring id: 0
+I1219 17:08:05.017560 25488 nccl_context.cc:74] init nccl context nranks: 8 local rank: 3 gpu id: 3 ring id: 0
+I1219 17:08:05.017565 25537 nccl_context.cc:74] init nccl context nranks: 8 local rank: 6 gpu id: 6 ring id: 0
+I1219 17:08:05.017578 25503 nccl_context.cc:74] init nccl context nranks: 8 local rank: 4 gpu id: 4 ring id: 0
+I1219 17:08:05.017585 25520 nccl_context.cc:74] init nccl context nranks: 8 local rank: 5 gpu id: 5 ring id: 0
+I1219 17:08:05.017601 25554 nccl_context.cc:74] init nccl context nranks: 8 local rank: 7 gpu id: 7 ring id: 0
+I1219 17:08:05.017613 25441 nccl_context.cc:74] init nccl context nranks: 8 local rank: 0 gpu id: 0 ring id: 0
+W1219 17:08:09.206136 25441 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.206564 25456 device_context.cc:447] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.206579 25554 device_context.cc:447] Please NOTE: device: 7, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.206670 25488 device_context.cc:447] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.206694 25520 device_context.cc:447] Please NOTE: device: 5, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.206728 25503 device_context.cc:447] Please NOTE: device: 4, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.209081 25537 device_context.cc:447] Please NOTE: device: 6, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.209785 25473 device_context.cc:447] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W1219 17:08:09.212059 25456 device_context.cc:465] device: 1, cuDNN Version: 7.6.
+W1219 17:08:09.212066 25554 device_context.cc:465] device: 7, cuDNN Version: 7.6.
+W1219 17:08:09.212080 25503 device_context.cc:465] device: 4, cuDNN Version: 7.6.
+W1219 17:08:09.212086 25520 device_context.cc:465] device: 5, cuDNN Version: 7.6.
+W1219 17:08:09.212086 25488 device_context.cc:465] device: 3, cuDNN Version: 7.6.
+W1219 17:08:09.212239 25441 device_context.cc:465] device: 0, cuDNN Version: 7.6.
+W1219 17:08:09.213409 25537 device_context.cc:465] device: 6, cuDNN Version: 7.6.
+W1219 17:08:09.214195 25473 device_context.cc:465] device: 2, cuDNN Version: 7.6.
+INFO:local_logger:----- world_size = 8, local_rank = 4
+INFO:local_logger:----- world_size = 8, local_rank = 1
+INFO:local_logger:----- world_size = 8, local_rank = 2
+INFO:master_logger:
+AMP: True
+BASE: ['']
+DATA:
+  BATCH_SIZE: 256
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.875
+  DATASET: imagenet2012
+  DATA_PATH: /dataset/imagenet
+  IMAGE_SIZE: 224
+  NUM_WORKERS: 2
+EVAL: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.0
+  DROPPATH: 0.0
+  MAE_PRETRAIN: True
+  NAME: vit_base_patch16_224_dec1
+  NUM_CLASSES: 1000
+  PRETRAINED: None
+  RESUME: None
+  TRANS:
+    DECODER:
+      DEPTH: 1
+      EMBED_DIM: 512
+      NUM_HEADS: 8
+    ENCODER:
+      DEPTH: 12
+      EMBED_DIM: 768
+      NUM_HEADS: 12
+    MASK_RATIO: 0.75
+    MLP_RATIO: 4.0
+    PATCH_SIZE: 16
+    QKV_BIAS: True
+  TYPE: MAE
+NGPUS: 8
+REPORT_FREQ: 100
+SAVE: ./output/train-20211219-17-07-40
+SAVE_FREQ: 1
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.00015
+  CUTMIX_ALPHA: 1.0
+  CUTMIX_MINMAX: None
+  END_LR: 0.0005
+  GRAD_CLIP: 1
+  LAST_EPOCH: 0
+  LINEAR_SCALED_LR: None
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  MIXUP_ALPHA: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  NORMALIZE_TARGET: True
+  NUM_EPOCHS: 800
+  OPTIMIZER:
+    BETAS: (0.9, 0.95)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: AdamW
+  RAND_AUGMENT: False
+  RAND_AUGMENT_LAYERS: 9
+  RAND_AUGMENT_MAGNITUDE: 5
+  SMOOTHING: 0.1
+  WARMUP_EPOCHS: 40
+  WARMUP_START_LR: 1e-06
+  WEIGHT_DECAY: 0.05
+VALIDATE_FREQ: 100
+INFO:local_logger:----- world_size = 8, local_rank = 0
+INFO:master_logger:----- world_size = 8, local_rank = 0
+INFO:local_logger:----- world_size = 8, local_rank = 6
+INFO:local_logger:----- world_size = 8, local_rank = 5
+INFO:local_logger:----- world_size = 8, local_rank = 7
+INFO:local_logger:----- world_size = 8, local_rank = 3
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:master_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:master_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:master_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:----- Total # of train batch (single gpu): 626
+INFO:local_logger:Start training from epoch 1.
+INFO:local_logger:Now training epoch 1. LR=0.000005
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1468
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1446
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1495
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1428
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1450
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1461
+INFO:master_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1454
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1459
+INFO:local_logger:Epoch[001/800], Step[0000/0626], Avg Loss: 1.1427
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1136
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1140
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1137
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1132
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1132
+INFO:master_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1136
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1135
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1138
+INFO:local_logger:Epoch[001/800], Step[0100/0626], Avg Loss: 1.1139
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0903
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0904
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0904
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0908
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0903
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0900
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0904
+INFO:local_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0902
+INFO:master_logger:Epoch[001/800], Step[0200/0626], Avg Loss: 1.0904
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0723
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0717
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0718
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0716
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0719
+INFO:master_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0719
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0718
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0720
+INFO:local_logger:Epoch[001/800], Step[0300/0626], Avg Loss: 1.0720
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0576
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0572
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0572
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0570
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0573
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0570
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0573
+INFO:master_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0572
+INFO:local_logger:Epoch[001/800], Step[0400/0626], Avg Loss: 1.0574
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0461
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0459
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0459
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0461
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0457
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0461
+INFO:master_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0460
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0463
+INFO:local_logger:Epoch[001/800], Step[0500/0626], Avg Loss: 1.0461
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0374
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0374
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0375
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0375
+INFO:master_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0375
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0372
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0377
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0379
+INFO:local_logger:Epoch[001/800], Step[0600/0626], Avg Loss: 1.0374
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0359, time: 934.80
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0356, time: 934.81
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0354, time: 934.86
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0361, time: 934.98
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0358, time: 935.03
+INFO:master_logger:----- Epoch[001/800], Train Loss: 1.0357, time: 935.03
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0358, time: 935.07
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0356, time: 935.07
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Epoch[001/800], Train Loss: 1.0357, time: 935.09
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-1-Loss-1.0357822933105671.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-1-Loss-1.0357822933105671.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-1-Loss-1.0357822933105671.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-1-Loss-1.0357822933105671.pdopt
+INFO:local_logger:Now training epoch 2. LR=0.000008
+INFO:master_logger:Now training epoch 2. LR=0.000008
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9953
+INFO:master_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9905
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9836
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9941
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9887
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9872
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9919
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9949
+INFO:local_logger:Epoch[002/800], Step[0000/0626], Avg Loss: 0.9885
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9896
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9894
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9900
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9895
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9901
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9887
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9897
+INFO:master_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9896
+INFO:local_logger:Epoch[002/800], Step[0100/0626], Avg Loss: 0.9900
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9880
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9889
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9887
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9883
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9887
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9887
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9883
+INFO:master_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9885
+INFO:local_logger:Epoch[002/800], Step[0200/0626], Avg Loss: 0.9883
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9878
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9874
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9873
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9875
+INFO:master_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9876
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9877
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9880
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9878
+INFO:local_logger:Epoch[002/800], Step[0300/0626], Avg Loss: 0.9872
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9872
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9870
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9867
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9867
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9870
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9871
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9870
+INFO:local_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9868
+INFO:master_logger:Epoch[002/800], Step[0400/0626], Avg Loss: 0.9869
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9862
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9865
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9861
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9864
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9863
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9861
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9862
+INFO:local_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9863
+INFO:master_logger:Epoch[002/800], Step[0500/0626], Avg Loss: 0.9863
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9856
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9858
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9858
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9855
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9855
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9856
+INFO:master_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9856
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9856
+INFO:local_logger:Epoch[002/800], Step[0600/0626], Avg Loss: 0.9856
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9857, time: 891.36
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9855, time: 891.28
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9853, time: 891.70
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9855, time: 891.46
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9853, time: 891.66
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9855, time: 891.47
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9857, time: 891.56
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:----- Epoch[002/800], Train Loss: 0.9854, time: 887.62
+INFO:master_logger:----- Epoch[002/800], Train Loss: 0.9855, time: 887.62
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-2-Loss-0.9854484576284688.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-2-Loss-0.9854484576284688.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-2-Loss-0.9854484576284688.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-2-Loss-0.9854484576284688.pdopt
+INFO:local_logger:Now training epoch 3. LR=0.000012
+INFO:master_logger:Now training epoch 3. LR=0.000012
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9859
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9784
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9751
+INFO:master_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9809
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9834
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9795
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9809
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9833
+INFO:local_logger:Epoch[003/800], Step[0000/0626], Avg Loss: 0.9810
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9816
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9810
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9814
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9810
+INFO:master_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9813
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9813
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9814
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9813
+INFO:local_logger:Epoch[003/800], Step[0100/0626], Avg Loss: 0.9814
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9807
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9808
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9808
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9806
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9806
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9804
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9804
+INFO:local_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9804
+INFO:master_logger:Epoch[003/800], Step[0200/0626], Avg Loss: 0.9806
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9797
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9799
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9799
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9802
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9797
+INFO:master_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9799
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9798
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9799
+INFO:local_logger:Epoch[003/800], Step[0300/0626], Avg Loss: 0.9798
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9791
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9790
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9793
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9789
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9789
+INFO:master_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9790
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9789
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9789
+INFO:local_logger:Epoch[003/800], Step[0400/0626], Avg Loss: 0.9791
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9780
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9782
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9782
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9783
+INFO:master_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9782
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9781
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9786
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9783
+INFO:local_logger:Epoch[003/800], Step[0500/0626], Avg Loss: 0.9781
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9776
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9776
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9774
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9774
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9778
+INFO:master_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9775
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9774
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9773
+INFO:local_logger:Epoch[003/800], Step[0600/0626], Avg Loss: 0.9773
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9774, time: 893.09
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9772, time: 893.23
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9776, time: 893.27
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9771, time: 893.31
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9772, time: 893.74
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9772, time: 889.63
+INFO:master_logger:----- Epoch[003/800], Train Loss: 0.9773, time: 889.63
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9773, time: 893.40
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Epoch[003/800], Train Loss: 0.9775, time: 893.56
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-3-Loss-0.9772286424963117.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-3-Loss-0.9772286424963117.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-3-Loss-0.9772286424963117.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-3-Loss-0.9772286424963117.pdopt
+INFO:local_logger:Now training epoch 4. LR=0.000016
+INFO:master_logger:Now training epoch 4. LR=0.000016
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9778
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9751
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9713
+INFO:master_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9734
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9753
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9753
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9704
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9683
+INFO:local_logger:Epoch[004/800], Step[0000/0626], Avg Loss: 0.9740
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9727
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9724
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9730
+INFO:master_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9728
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9731
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9730
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9729
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9730
+INFO:local_logger:Epoch[004/800], Step[0100/0626], Avg Loss: 0.9726
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9724
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9725
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9721
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9721
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9721
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9720
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9722
+INFO:local_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9724
+INFO:master_logger:Epoch[004/800], Step[0200/0626], Avg Loss: 0.9722
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9715
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9717
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9717
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9720
+INFO:master_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9717
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9712
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9718
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9718
+INFO:local_logger:Epoch[004/800], Step[0300/0626], Avg Loss: 0.9716
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9712
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9711
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9711
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9715
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9712
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9709
+INFO:master_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9712
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9714
+INFO:local_logger:Epoch[004/800], Step[0400/0626], Avg Loss: 0.9714
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9707
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9706
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9709
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9709
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9707
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9708
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9705
+INFO:master_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9707
+INFO:local_logger:Epoch[004/800], Step[0500/0626], Avg Loss: 0.9706
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9701
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9704
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9703
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9701
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9703
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9701
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9704
+INFO:master_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9703
+INFO:local_logger:Epoch[004/800], Step[0600/0626], Avg Loss: 0.9704
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9702, time: 854.73
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9703, time: 851.06
+INFO:master_logger:----- Epoch[004/800], Train Loss: 0.9702, time: 851.06
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9703, time: 854.82
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9700, time: 855.11
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9700, time: 855.36
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9703, time: 855.48
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9700, time: 855.31
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:----- Epoch[004/800], Train Loss: 0.9702, time: 855.19
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-4-Loss-0.97028241060033.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-4-Loss-0.97028241060033.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-4-Loss-0.97028241060033.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-4-Loss-0.97028241060033.pdopt
+INFO:local_logger:Now training epoch 5. LR=0.000020
+INFO:master_logger:Now training epoch 5. LR=0.000020
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9655
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9667
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9651
+INFO:master_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9667
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9671
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9619
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9712
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9685
+INFO:local_logger:Epoch[005/800], Step[0000/0626], Avg Loss: 0.9674
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9675
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9674
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9672
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9682
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9673
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9671
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9679
+INFO:master_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9675
+INFO:local_logger:Epoch[005/800], Step[0100/0626], Avg Loss: 0.9672
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9670
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9665
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9669
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9669
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9666
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9673
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9672
+INFO:local_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9671
+INFO:master_logger:Epoch[005/800], Step[0200/0626], Avg Loss: 0.9669
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9661
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9663
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9665
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9665
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9664
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9667
+INFO:master_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9665
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9668
+INFO:local_logger:Epoch[005/800], Step[0300/0626], Avg Loss: 0.9665
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9661
+INFO:master_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9660
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9662
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9660
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9661
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9658
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9660
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9658
+INFO:local_logger:Epoch[005/800], Step[0400/0626], Avg Loss: 0.9660
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9655
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9655
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9657
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9657
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9656
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9656
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9657
+INFO:master_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9656
+INFO:local_logger:Epoch[005/800], Step[0500/0626], Avg Loss: 0.9654
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9651
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9653
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9653
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9654
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9652
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9652
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9649
+INFO:master_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9652
+INFO:local_logger:Epoch[005/800], Step[0600/0626], Avg Loss: 0.9651
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9648, time: 889.02
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9651, time: 889.10
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9652, time: 889.53
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9652, time: 885.85
+INFO:master_logger:----- Epoch[005/800], Train Loss: 0.9651, time: 885.85
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9651, time: 889.20
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9650, time: 889.56
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9650, time: 889.67
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Epoch[005/800], Train Loss: 0.9653, time: 890.15
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-5-Loss-0.9652042168475674.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-5-Loss-0.9652042168475674.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-5-Loss-0.9652042168475674.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-5-Loss-0.9652042168475674.pdopt
+INFO:local_logger:Now training epoch 6. LR=0.000023
+INFO:master_logger:Now training epoch 6. LR=0.000023
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9660
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9643
+INFO:master_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9604
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9476
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9637
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9585
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9542
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9641
+INFO:local_logger:Epoch[006/800], Step[0000/0626], Avg Loss: 0.9651
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9622
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9624
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9627
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9625
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9626
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9626
+INFO:master_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9626
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9627
+INFO:local_logger:Epoch[006/800], Step[0100/0626], Avg Loss: 0.9632
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9624
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9619
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9619
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9624
+INFO:master_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9622
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9620
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9622
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9624
+INFO:local_logger:Epoch[006/800], Step[0200/0626], Avg Loss: 0.9620
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9613
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9620
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9620
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9620
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9615
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9618
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9621
+INFO:master_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9618
+INFO:local_logger:Epoch[006/800], Step[0300/0626], Avg Loss: 0.9617
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9610
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9615
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9612
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9614
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9614
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9612
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9616
+INFO:master_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9613
+INFO:local_logger:Epoch[006/800], Step[0400/0626], Avg Loss: 0.9614
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9609
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9609
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9612
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9608
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9611
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9608
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9610
+INFO:master_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9610
+INFO:local_logger:Epoch[006/800], Step[0500/0626], Avg Loss: 0.9612
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9604
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9608
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9606
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9607
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9605
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9606
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9603
+INFO:local_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9609
+INFO:master_logger:Epoch[006/800], Step[0600/0626], Avg Loss: 0.9606
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9605, time: 860.53
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9607, time: 860.72
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9604, time: 857.72
+INFO:master_logger:----- Epoch[006/800], Train Loss: 0.9605, time: 857.72
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9607, time: 861.65
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9602, time: 861.47
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9603, time: 861.13
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9608, time: 861.53
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:----- Epoch[006/800], Train Loss: 0.9603, time: 861.59
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-6-Loss-0.9604088297024008.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-6-Loss-0.9604088297024008.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-6-Loss-0.9604088297024008.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-6-Loss-0.9604088297024008.pdopt
+INFO:local_logger:Now training epoch 7. LR=0.000027
+INFO:master_logger:Now training epoch 7. LR=0.000027
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9534
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9591
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9552
+INFO:master_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9581
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9540
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9572
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9591
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9645
+INFO:local_logger:Epoch[007/800], Step[0000/0626], Avg Loss: 0.9624
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9583
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9576
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9586
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9575
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9582
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9584
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9589
+INFO:master_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9582
+INFO:local_logger:Epoch[007/800], Step[0100/0626], Avg Loss: 0.9584
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9580
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9575
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9573
+INFO:master_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9578
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9580
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9581
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9578
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9578
+INFO:local_logger:Epoch[007/800], Step[0200/0626], Avg Loss: 0.9577
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9571
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9570
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9575
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9573
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9570
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9574
+INFO:master_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9573
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9574
+INFO:local_logger:Epoch[007/800], Step[0300/0626], Avg Loss: 0.9577
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9566
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9566
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9567
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9572
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9568
+INFO:master_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9568
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9568
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9570
+INFO:local_logger:Epoch[007/800], Step[0400/0626], Avg Loss: 0.9568
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9563
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9568
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9561
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9565
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9565
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9563
+INFO:master_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9564
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9565
+INFO:local_logger:Epoch[007/800], Step[0500/0626], Avg Loss: 0.9563
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9564
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9561
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9561
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9559
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9559
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9558
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9558
+INFO:master_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9560
+INFO:local_logger:Epoch[007/800], Step[0600/0626], Avg Loss: 0.9561
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9560, time: 889.20
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9558, time: 888.65
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9557, time: 889.07
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9563, time: 888.69
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9559, time: 888.70
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9558, time: 888.74
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9557, time: 885.04
+INFO:local_logger:----- Epoch[007/800], Train Loss: 0.9560, time: 888.76
+INFO:master_logger:----- Epoch[007/800], Train Loss: 0.9559, time: 885.04
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-7-Loss-0.9557424400537671.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-7-Loss-0.9557424400537671.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-7-Loss-0.9557424400537671.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-7-Loss-0.9557424400537671.pdopt
+INFO:local_logger:Now training epoch 8. LR=0.000031
+INFO:master_logger:Now training epoch 8. LR=0.000031
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9562
+INFO:master_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9506
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9529
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9443
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9491
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9499
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9539
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9524
+INFO:local_logger:Epoch[008/800], Step[0000/0626], Avg Loss: 0.9463
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9530
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9530
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9531
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9531
+INFO:master_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9532
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9528
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9532
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9535
+INFO:local_logger:Epoch[008/800], Step[0100/0626], Avg Loss: 0.9540
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9527
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9531
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9526
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9526
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9525
+INFO:master_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9527
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9528
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9529
+INFO:local_logger:Epoch[008/800], Step[0200/0626], Avg Loss: 0.9524
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9524
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9520
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9521
+INFO:master_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9523
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9524
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9526
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9520
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9521
+INFO:local_logger:Epoch[008/800], Step[0300/0626], Avg Loss: 0.9525
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9517
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9518
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9516
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9519
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9516
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9515
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9518
+INFO:master_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9517
+INFO:local_logger:Epoch[008/800], Step[0400/0626], Avg Loss: 0.9518
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9511
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9511
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9513
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9511
+INFO:master_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9512
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9512
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9513
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9511
+INFO:local_logger:Epoch[008/800], Step[0500/0626], Avg Loss: 0.9512
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9506
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9506
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9506
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9508
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9505
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9506
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9509
+INFO:master_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9506
+INFO:local_logger:Epoch[008/800], Step[0600/0626], Avg Loss: 0.9506
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9505, time: 854.97
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9507, time: 855.87
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9506, time: 855.95
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9504, time: 855.93
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9504, time: 852.20
+INFO:master_logger:----- Epoch[008/800], Train Loss: 0.9505, time: 852.20
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9504, time: 855.94
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9504, time: 855.86
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:----- Epoch[008/800], Train Loss: 0.9506, time: 855.86
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-8-Loss-0.950418085337367.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-8-Loss-0.950418085337367.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-8-Loss-0.950418085337367.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-8-Loss-0.950418085337367.pdopt
+INFO:local_logger:Now training epoch 9. LR=0.000035
+INFO:master_logger:Now training epoch 9. LR=0.000035
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9532
+INFO:master_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9494
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9472
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9535
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9457
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9521
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9484
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9460
+INFO:local_logger:Epoch[009/800], Step[0000/0626], Avg Loss: 0.9495
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9469
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9469
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9473
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9469
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9459
+INFO:master_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9466
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9459
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9465
+INFO:local_logger:Epoch[009/800], Step[0100/0626], Avg Loss: 0.9465
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9466
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9460
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9461
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9462
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9455
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9466
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9460
+INFO:local_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9465
+INFO:master_logger:Epoch[009/800], Step[0200/0626], Avg Loss: 0.9462
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9455
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9455
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9450
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9451
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9449
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9455
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9452
+INFO:local_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9448
+INFO:master_logger:Epoch[009/800], Step[0300/0626], Avg Loss: 0.9452
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9441
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9447
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9441
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9444
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9444
+INFO:master_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9444
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9445
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9446
+INFO:local_logger:Epoch[009/800], Step[0400/0626], Avg Loss: 0.9442
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9437
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9432
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9436
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9435
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9434
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9434
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9434
+INFO:master_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9435
+INFO:local_logger:Epoch[009/800], Step[0500/0626], Avg Loss: 0.9437
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9426
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9427
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9428
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9423
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9427
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9421
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9426
+INFO:local_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9427
+INFO:master_logger:Epoch[009/800], Step[0600/0626], Avg Loss: 0.9426
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9425, time: 891.29
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9425, time: 886.67
+INFO:master_logger:----- Epoch[009/800], Train Loss: 0.9424, time: 886.67
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9420, time: 891.03
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9425, time: 891.03
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9421, time: 891.05
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9424, time: 891.03
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9426, time: 891.05
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Epoch[009/800], Train Loss: 0.9425, time: 891.04
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-9-Loss-0.9425096387053156.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-9-Loss-0.9425096387053156.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-9-Loss-0.9425096387053156.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-9-Loss-0.9425096387053156.pdopt
+INFO:local_logger:Now training epoch 10. LR=0.000038
+INFO:master_logger:Now training epoch 10. LR=0.000038
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9389
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9403
+INFO:master_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9385
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9304
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9343
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9450
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9331
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9421
+INFO:local_logger:Epoch[010/800], Step[0000/0626], Avg Loss: 0.9438
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9362
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9361
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9356
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9360
+INFO:master_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9362
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9362
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9361
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9361
+INFO:local_logger:Epoch[010/800], Step[0100/0626], Avg Loss: 0.9371
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9354
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9358
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9355
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9355
+INFO:master_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9357
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9360
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9355
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9359
+INFO:local_logger:Epoch[010/800], Step[0200/0626], Avg Loss: 0.9358
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9345
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9350
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9346
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9345
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9348
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9348
+INFO:master_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9348
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9348
+INFO:local_logger:Epoch[010/800], Step[0300/0626], Avg Loss: 0.9350
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9337
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9337
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9336
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9339
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9340
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9340
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9340
+INFO:master_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9338
+INFO:local_logger:Epoch[010/800], Step[0400/0626], Avg Loss: 0.9337
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9330
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9327
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9328
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9328
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9330
+INFO:master_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9329
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9326
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9329
+INFO:local_logger:Epoch[010/800], Step[0500/0626], Avg Loss: 0.9330
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9320
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9320
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9323
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9321
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9322
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9321
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9320
+INFO:local_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9321
+INFO:master_logger:Epoch[010/800], Step[0600/0626], Avg Loss: 0.9321
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9317, time: 857.40
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9320, time: 857.41
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9318, time: 854.67
+INFO:master_logger:----- Epoch[010/800], Train Loss: 0.9318, time: 854.67
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9318, time: 858.49
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9319, time: 857.80
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9319, time: 857.83
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9319, time: 857.81
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Epoch[010/800], Train Loss: 0.9318, time: 857.82
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-10-Loss-0.9318290638491608.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-10-Loss-0.9318290638491608.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-10-Loss-0.9318290638491608.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-10-Loss-0.9318290638491608.pdopt
+INFO:local_logger:Now training epoch 11. LR=0.000042
+INFO:master_logger:Now training epoch 11. LR=0.000042
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9246
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9293
+INFO:master_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9253
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9166
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9227
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9325
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9194
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9280
+INFO:local_logger:Epoch[011/800], Step[0000/0626], Avg Loss: 0.9296
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9257
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9242
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9247
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9260
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9256
+INFO:master_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9254
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9255
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9251
+INFO:local_logger:Epoch[011/800], Step[0100/0626], Avg Loss: 0.9261
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9257
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9247
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9255
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9252
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9255
+INFO:master_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9252
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9245
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9253
+INFO:local_logger:Epoch[011/800], Step[0200/0626], Avg Loss: 0.9256
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9241
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9237
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9244
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9244
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9246
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9238
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9242
+INFO:local_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9241
+INFO:master_logger:Epoch[011/800], Step[0300/0626], Avg Loss: 0.9242
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9234
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9229
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9234
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9229
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9234
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9233
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9233
+INFO:master_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9232
+INFO:local_logger:Epoch[011/800], Step[0400/0626], Avg Loss: 0.9235
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9224
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9222
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9225
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9217
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9223
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9222
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9219
+INFO:master_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9222
+INFO:local_logger:Epoch[011/800], Step[0500/0626], Avg Loss: 0.9223
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9214
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9207
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9211
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9211
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9210
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9212
+INFO:master_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9211
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9212
+INFO:local_logger:Epoch[011/800], Step[0600/0626], Avg Loss: 0.9213
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9209, time: 888.60
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9210, time: 888.60
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9208, time: 889.00
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9209, time: 888.65
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9209, time: 884.67
+INFO:master_logger:----- Epoch[011/800], Train Loss: 0.9209, time: 884.67
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9205, time: 888.70
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9211, time: 889.09
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:----- Epoch[011/800], Train Loss: 0.9209, time: 888.68
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-11-Loss-0.9209032249693648.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-11-Loss-0.9209032249693648.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-11-Loss-0.9209032249693648.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-11-Loss-0.9209032249693648.pdopt
+INFO:local_logger:Now training epoch 12. LR=0.000046
+INFO:master_logger:Now training epoch 12. LR=0.000046
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9086
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9130
+INFO:master_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9125
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9171
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9156
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9156
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9171
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9090
+INFO:local_logger:Epoch[012/800], Step[0000/0626], Avg Loss: 0.9038
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9149
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9150
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9146
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9152
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9145
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9153
+INFO:master_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9149
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9149
+INFO:local_logger:Epoch[012/800], Step[0100/0626], Avg Loss: 0.9149
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9144
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9141
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9138
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9139
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9143
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9141
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9142
+INFO:master_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9142
+INFO:local_logger:Epoch[012/800], Step[0200/0626], Avg Loss: 0.9145
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9128
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9132
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9126
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9129
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9133
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9131
+INFO:master_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9130
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9127
+INFO:local_logger:Epoch[012/800], Step[0300/0626], Avg Loss: 0.9132
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9121
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9115
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9118
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9120
+INFO:master_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9119
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9117
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9118
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9119
+INFO:local_logger:Epoch[012/800], Step[0400/0626], Avg Loss: 0.9121
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9113
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9111
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9111
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9108
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9108
+INFO:master_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9111
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9111
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9113
+INFO:local_logger:Epoch[012/800], Step[0500/0626], Avg Loss: 0.9112
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9103
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9101
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9102
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9105
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9103
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9103
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9101
+INFO:local_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9099
+INFO:master_logger:Epoch[012/800], Step[0600/0626], Avg Loss: 0.9102
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9102, time: 850.59
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9099, time: 850.55
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9104, time: 851.02
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9099, time: 851.10
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9100, time: 851.10
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9097, time: 847.34
+INFO:master_logger:----- Epoch[012/800], Train Loss: 0.9101, time: 847.34
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9101, time: 851.03
+INFO:local_logger:----- Epoch[012/800], Train Loss: 0.9101, time: 851.05
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-12-Loss-0.9097320030754859.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-12-Loss-0.9097320030754859.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-12-Loss-0.9097320030754859.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-12-Loss-0.9097320030754859.pdopt
+INFO:local_logger:Now training epoch 13. LR=0.000049
+INFO:master_logger:Now training epoch 13. LR=0.000049
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9093
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9118
+INFO:master_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9072
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9034
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9130
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9112
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9077
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.9036
+INFO:local_logger:Epoch[013/800], Step[0000/0626], Avg Loss: 0.8972
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9039
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9045
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9040
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9050
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9046
+INFO:master_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9044
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9047
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9041
+INFO:local_logger:Epoch[013/800], Step[0100/0626], Avg Loss: 0.9043
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9035
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9038
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9041
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9039
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9040
+INFO:master_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9039
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9040
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9039
+INFO:local_logger:Epoch[013/800], Step[0200/0626], Avg Loss: 0.9040
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9027
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9024
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9030
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9027
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9029
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9032
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9029
+INFO:master_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9028
+INFO:local_logger:Epoch[013/800], Step[0300/0626], Avg Loss: 0.9029
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9021
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9018
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9023
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9018
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9019
+INFO:master_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9019
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9015
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9020
+INFO:local_logger:Epoch[013/800], Step[0400/0626], Avg Loss: 0.9016
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9012
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9014
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9018
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9013
+INFO:master_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9013
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9014
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9011
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9008
+INFO:local_logger:Epoch[013/800], Step[0500/0626], Avg Loss: 0.9010
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9002
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9003
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9009
+INFO:master_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9003
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.8999
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9002
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9001
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9003
+INFO:local_logger:Epoch[013/800], Step[0600/0626], Avg Loss: 0.9006
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.9000, time: 883.21
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.8998, time: 883.55
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.9000, time: 879.83
+INFO:master_logger:----- Epoch[013/800], Train Loss: 0.9000, time: 879.83
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.8996, time: 883.81
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.9003, time: 884.67
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.8999, time: 884.16
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.9005, time: 884.17
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Epoch[013/800], Train Loss: 0.8999, time: 884.64
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-13-Loss-0.9000374903566999.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-13-Loss-0.9000374903566999.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-13-Loss-0.9000374903566999.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-13-Loss-0.9000374903566999.pdopt
+INFO:local_logger:Now training epoch 14. LR=0.000053
+INFO:master_logger:Now training epoch 14. LR=0.000053
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8904
+INFO:master_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8921
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8961
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8964
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8893
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8854
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8944
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8877
+INFO:local_logger:Epoch[014/800], Step[0000/0626], Avg Loss: 0.8971
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8953
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8955
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8938
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8954
+INFO:master_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8951
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8947
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8947
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8950
+INFO:local_logger:Epoch[014/800], Step[0100/0626], Avg Loss: 0.8962
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8930
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8934
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8928
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8927
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8928
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8934
+INFO:master_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8931
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8934
+INFO:local_logger:Epoch[014/800], Step[0200/0626], Avg Loss: 0.8931
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8927
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8919
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8921
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8920
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8926
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8924
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8919
+INFO:master_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8922
+INFO:local_logger:Epoch[014/800], Step[0300/0626], Avg Loss: 0.8920
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8909
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8914
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8908
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8909
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8912
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8910
+INFO:master_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8910
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8906
+INFO:local_logger:Epoch[014/800], Step[0400/0626], Avg Loss: 0.8914
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8903
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8904
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8906
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8900
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8906
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8902
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8901
+INFO:local_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8902
+INFO:master_logger:Epoch[014/800], Step[0500/0626], Avg Loss: 0.8903
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8896
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8898
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8896
+INFO:master_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8896
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8897
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8893
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8897
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8893
+INFO:local_logger:Epoch[014/800], Step[0600/0626], Avg Loss: 0.8894
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8895, time: 845.39
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8895, time: 845.75
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8892, time: 846.69
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8894, time: 846.08
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8891, time: 847.03
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8890, time: 846.07
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8893, time: 842.90
+INFO:master_logger:----- Epoch[014/800], Train Loss: 0.8893, time: 842.90
+INFO:local_logger:----- Epoch[014/800], Train Loss: 0.8894, time: 846.07
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-14-Loss-0.8892871914493445.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-14-Loss-0.8892871914493445.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-14-Loss-0.8892871914493445.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-14-Loss-0.8892871914493445.pdopt
+INFO:local_logger:Now training epoch 15. LR=0.000057
+INFO:master_logger:Now training epoch 15. LR=0.000057
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8662
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8831
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8842
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8880
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8864
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8846
+INFO:master_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8812
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8773
+INFO:local_logger:Epoch[015/800], Step[0000/0626], Avg Loss: 0.8795
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8853
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8863
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8867
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8861
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8861
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8862
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8860
+INFO:master_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8860
+INFO:local_logger:Epoch[015/800], Step[0100/0626], Avg Loss: 0.8851
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8851
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8849
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8845
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8847
+INFO:master_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8846
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8841
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8841
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8847
+INFO:local_logger:Epoch[015/800], Step[0200/0626], Avg Loss: 0.8845
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8832
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8831
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8833
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8827
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8833
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8833
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8830
+INFO:master_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8831
+INFO:local_logger:Epoch[015/800], Step[0300/0626], Avg Loss: 0.8826
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8827
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8824
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8825
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8824
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8827
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8820
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8828
+INFO:local_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8819
+INFO:master_logger:Epoch[015/800], Step[0400/0626], Avg Loss: 0.8824
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8819
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8813
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8812
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8818
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8816
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8820
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8815
+INFO:master_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8816
+INFO:local_logger:Epoch[015/800], Step[0500/0626], Avg Loss: 0.8818
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8808
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8805
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8807
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8807
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8804
+INFO:master_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8806
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8808
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8804
+INFO:local_logger:Epoch[015/800], Step[0600/0626], Avg Loss: 0.8809
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8805, time: 897.37
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8805, time: 893.90
+INFO:master_logger:----- Epoch[015/800], Train Loss: 0.8804, time: 893.90
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8805, time: 898.09
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8803, time: 898.09
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8802, time: 898.08
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8802, time: 898.09
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8807, time: 898.77
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:----- Epoch[015/800], Train Loss: 0.8806, time: 898.79
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-15-Loss-0.8804958925234925.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-15-Loss-0.8804958925234925.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-15-Loss-0.8804958925234925.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-15-Loss-0.8804958925234925.pdopt
+INFO:local_logger:Now training epoch 16. LR=0.000061
+INFO:master_logger:Now training epoch 16. LR=0.000061
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8772
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8776
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8818
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8756
+INFO:master_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8774
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8834
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8802
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8772
+INFO:local_logger:Epoch[016/800], Step[0000/0626], Avg Loss: 0.8659
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8729
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8735
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8742
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8743
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8738
+INFO:master_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8738
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8741
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8747
+INFO:local_logger:Epoch[016/800], Step[0100/0626], Avg Loss: 0.8731
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8724
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8723
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8727
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8726
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8731
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8732
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8728
+INFO:master_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8728
+INFO:local_logger:Epoch[016/800], Step[0200/0626], Avg Loss: 0.8732
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8718
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8723
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8717
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8718
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8718
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8720
+INFO:master_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8719
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8722
+INFO:local_logger:Epoch[016/800], Step[0300/0626], Avg Loss: 0.8717
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8710
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8710
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8710
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8713
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8712
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8717
+INFO:master_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8712
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8713
+INFO:local_logger:Epoch[016/800], Step[0400/0626], Avg Loss: 0.8713
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8707
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8706
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8710
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8707
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8709
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8706
+INFO:master_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8707
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8705
+INFO:local_logger:Epoch[016/800], Step[0500/0626], Avg Loss: 0.8709
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8696
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8697
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8697
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8697
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8696
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8700
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8701
+INFO:master_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8698
+INFO:local_logger:Epoch[016/800], Step[0600/0626], Avg Loss: 0.8700
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8695, time: 861.71
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8693, time: 862.54
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8699, time: 862.86
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8695, time: 863.58
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8695, time: 862.86
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8698, time: 862.85
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8698, time: 862.87
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:----- Epoch[016/800], Train Loss: 0.8694, time: 859.59
+INFO:master_logger:----- Epoch[016/800], Train Loss: 0.8696, time: 859.59
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-16-Loss-0.8694310493630203.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-16-Loss-0.8694310493630203.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-16-Loss-0.8694310493630203.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-16-Loss-0.8694310493630203.pdopt
+INFO:local_logger:Now training epoch 17. LR=0.000064
+INFO:master_logger:Now training epoch 17. LR=0.000064
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8663
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8709
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8677
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8526
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8679
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8632
+INFO:master_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8658
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8670
+INFO:local_logger:Epoch[017/800], Step[0000/0626], Avg Loss: 0.8705
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8666
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8669
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8674
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8667
+INFO:master_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8668
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8662
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8672
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8665
+INFO:local_logger:Epoch[017/800], Step[0100/0626], Avg Loss: 0.8673
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8654
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8660
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8658
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8659
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8654
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8658
+INFO:master_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8657
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8656
+INFO:local_logger:Epoch[017/800], Step[0200/0626], Avg Loss: 0.8654
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8647
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8648
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8646
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8649
+INFO:master_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8647
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8652
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8646
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8642
+INFO:local_logger:Epoch[017/800], Step[0300/0626], Avg Loss: 0.8648
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8629
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8639
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8636
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8635
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8634
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8634
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8635
+INFO:local_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8633
+INFO:master_logger:Epoch[017/800], Step[0400/0626], Avg Loss: 0.8634
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8619
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8628
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8626
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8624
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8622
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8624
+INFO:master_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8624
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8625
+INFO:local_logger:Epoch[017/800], Step[0500/0626], Avg Loss: 0.8625
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8615
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8619
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8620
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8613
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8618
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8618
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8616
+INFO:master_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8617
+INFO:local_logger:Epoch[017/800], Step[0600/0626], Avg Loss: 0.8619
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8617, time: 890.30
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8616, time: 890.30
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8617, time: 890.31
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8610, time: 890.96
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8614, time: 887.14
+INFO:master_logger:----- Epoch[017/800], Train Loss: 0.8615, time: 887.14
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8616, time: 891.29
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8617, time: 890.99
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Epoch[017/800], Train Loss: 0.8614, time: 892.15
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-17-Loss-0.8613511298173326.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-17-Loss-0.8613511298173326.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-17-Loss-0.8613511298173326.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-17-Loss-0.8613511298173326.pdopt
+INFO:local_logger:Now training epoch 18. LR=0.000068
+INFO:master_logger:Now training epoch 18. LR=0.000068
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8610
+INFO:master_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8573
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8499
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8529
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8567
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8551
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8589
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8601
+INFO:local_logger:Epoch[018/800], Step[0000/0626], Avg Loss: 0.8641
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8555
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8553
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8547
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8552
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8543
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8543
+INFO:master_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8547
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8542
+INFO:local_logger:Epoch[018/800], Step[0100/0626], Avg Loss: 0.8543
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8544
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8543
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8543
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8541
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8537
+INFO:master_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8541
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8544
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8539
+INFO:local_logger:Epoch[018/800], Step[0200/0626], Avg Loss: 0.8541
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8534
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8534
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8536
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8535
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8540
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8535
+INFO:master_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8537
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8544
+INFO:local_logger:Epoch[018/800], Step[0300/0626], Avg Loss: 0.8538
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8536
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8534
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8534
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8532
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8542
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8532
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8537
+INFO:master_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8535
+INFO:local_logger:Epoch[018/800], Step[0400/0626], Avg Loss: 0.8530
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8533
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8536
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8529
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8531
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8534
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8529
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8534
+INFO:master_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8533
+INFO:local_logger:Epoch[018/800], Step[0500/0626], Avg Loss: 0.8541
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8533
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8525
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8531
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8529
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8528
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8525
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8536
+INFO:local_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8528
+INFO:master_logger:Epoch[018/800], Step[0600/0626], Avg Loss: 0.8529
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8532, time: 859.28
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8524, time: 859.95
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8527, time: 855.56
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8523, time: 859.27
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8527, time: 859.94
+INFO:master_logger:----- Epoch[018/800], Train Loss: 0.8528, time: 855.56
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8530, time: 859.29
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8528, time: 859.96
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:----- Epoch[018/800], Train Loss: 0.8534, time: 859.27
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-18-Loss-0.8526818839083388.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-18-Loss-0.8526818839083388.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-18-Loss-0.8526818839083388.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-18-Loss-0.8526818839083388.pdopt
+INFO:local_logger:Now training epoch 19. LR=0.000072
+INFO:master_logger:Now training epoch 19. LR=0.000072
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8466
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8442
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8470
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8424
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8531
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8522
+INFO:master_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8474
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8452
+INFO:local_logger:Epoch[019/800], Step[0000/0626], Avg Loss: 0.8487
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8481
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8485
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8477
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8475
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8477
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8473
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8491
+INFO:master_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8480
+INFO:local_logger:Epoch[019/800], Step[0100/0626], Avg Loss: 0.8481
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8476
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8477
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8476
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8481
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8480
+INFO:master_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8478
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8483
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8476
+INFO:local_logger:Epoch[019/800], Step[0200/0626], Avg Loss: 0.8478
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8470
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8465
+INFO:master_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8468
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8469
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8470
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8470
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8465
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8468
+INFO:local_logger:Epoch[019/800], Step[0300/0626], Avg Loss: 0.8467
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8458
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8460
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8460
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8464
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8461
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8460
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8465
+INFO:local_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8463
+INFO:master_logger:Epoch[019/800], Step[0400/0626], Avg Loss: 0.8461
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8452
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8454
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8450
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8455
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8451
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8452
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8452
+INFO:local_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8455
+INFO:master_logger:Epoch[019/800], Step[0500/0626], Avg Loss: 0.8452
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8445
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8442
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8446
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8445
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8447
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8445
+INFO:master_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8445
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8443
+INFO:local_logger:Epoch[019/800], Step[0600/0626], Avg Loss: 0.8445
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8446, time: 880.98
+INFO:master_logger:----- Epoch[019/800], Train Loss: 0.8443, time: 880.98
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8443, time: 885.40
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8443, time: 885.43
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8444, time: 885.46
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8441, time: 885.49
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8441, time: 885.53
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8443, time: 885.54
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Epoch[019/800], Train Loss: 0.8446, time: 885.53
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-19-Loss-0.8445631699389794.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-19-Loss-0.8445631699389794.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-19-Loss-0.8445631699389794.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-19-Loss-0.8445631699389794.pdopt
+INFO:local_logger:Now training epoch 20. LR=0.000075
+INFO:master_logger:Now training epoch 20. LR=0.000075
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8395
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8579
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8337
+INFO:master_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8394
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8377
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8425
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8297
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8371
+INFO:local_logger:Epoch[020/800], Step[0000/0626], Avg Loss: 0.8374
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8389
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8399
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8385
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8402
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8398
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8399
+INFO:master_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8396
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8398
+INFO:local_logger:Epoch[020/800], Step[0100/0626], Avg Loss: 0.8400
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8402
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8410
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8400
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8406
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8409
+INFO:master_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8408
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8403
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8416
+INFO:local_logger:Epoch[020/800], Step[0200/0626], Avg Loss: 0.8415
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8399
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8411
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8404
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8406
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8406
+INFO:master_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8406
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8403
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8415
+INFO:local_logger:Epoch[020/800], Step[0300/0626], Avg Loss: 0.8403
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8397
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8397
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8393
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8400
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8395
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8400
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8397
+INFO:master_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8398
+INFO:local_logger:Epoch[020/800], Step[0400/0626], Avg Loss: 0.8404
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8384
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8386
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8384
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8388
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8383
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8387
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8384
+INFO:master_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8386
+INFO:local_logger:Epoch[020/800], Step[0500/0626], Avg Loss: 0.8391
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8387
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8378
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8380
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8382
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8383
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8381
+INFO:master_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8382
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8380
+INFO:local_logger:Epoch[020/800], Step[0600/0626], Avg Loss: 0.8383
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8379, time: 856.21
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8378, time: 856.64
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8381, time: 856.54
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8378, time: 856.55
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8378, time: 856.54
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8385, time: 856.62
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8379, time: 856.58
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:----- Epoch[020/800], Train Loss: 0.8377, time: 853.74
+INFO:master_logger:----- Epoch[020/800], Train Loss: 0.8379, time: 853.74
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-20-Loss-0.837697342612629.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-20-Loss-0.837697342612629.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-20-Loss-0.837697342612629.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-20-Loss-0.837697342612629.pdopt
+INFO:local_logger:Now training epoch 21. LR=0.000079
+INFO:master_logger:Now training epoch 21. LR=0.000079
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8247
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8468
+INFO:master_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8311
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8307
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8301
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8220
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8352
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8284
+INFO:local_logger:Epoch[021/800], Step[0000/0626], Avg Loss: 0.8313
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8345
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8327
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8336
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8344
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8331
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8343
+INFO:master_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8338
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8339
+INFO:local_logger:Epoch[021/800], Step[0100/0626], Avg Loss: 0.8339
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8343
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8340
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8344
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8339
+INFO:master_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8338
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8333
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8332
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8337
+INFO:local_logger:Epoch[021/800], Step[0200/0626], Avg Loss: 0.8335
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8328
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8336
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8330
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8331
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8331
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8337
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8328
+INFO:local_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8336
+INFO:master_logger:Epoch[021/800], Step[0300/0626], Avg Loss: 0.8332
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8324
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8322
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8329
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8326
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8333
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8324
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8332
+INFO:master_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8327
+INFO:local_logger:Epoch[021/800], Step[0400/0626], Avg Loss: 0.8328
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8323
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8322
+INFO:master_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8321
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8325
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8319
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8317
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8323
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8319
+INFO:local_logger:Epoch[021/800], Step[0500/0626], Avg Loss: 0.8320
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8319
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8316
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8318
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8314
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8317
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8314
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8315
+INFO:master_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8316
+INFO:local_logger:Epoch[021/800], Step[0600/0626], Avg Loss: 0.8317
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8314, time: 903.73
+INFO:master_logger:----- Epoch[021/800], Train Loss: 0.8313, time: 903.73
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8311, time: 908.09
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8312, time: 908.50
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8312, time: 908.98
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8314, time: 908.52
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8314, time: 908.52
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8311, time: 908.52
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Epoch[021/800], Train Loss: 0.8317, time: 908.52
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-21-Loss-0.8314437446567381.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-21-Loss-0.8314437446567381.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-21-Loss-0.8314437446567381.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-21-Loss-0.8314437446567381.pdopt
+INFO:local_logger:Now training epoch 22. LR=0.000083
+INFO:master_logger:Now training epoch 22. LR=0.000083
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8238
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8287
+INFO:master_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8236
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8314
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8120
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8206
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8245
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8214
+INFO:local_logger:Epoch[022/800], Step[0000/0626], Avg Loss: 0.8266
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8256
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8255
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8256
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8256
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8274
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8262
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8270
+INFO:master_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8262
+INFO:local_logger:Epoch[022/800], Step[0100/0626], Avg Loss: 0.8269
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8246
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8258
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8262
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8250
+INFO:master_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8252
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8250
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8256
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8245
+INFO:local_logger:Epoch[022/800], Step[0200/0626], Avg Loss: 0.8253
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8236
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8237
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8235
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8239
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8245
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8242
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8232
+INFO:local_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8245
+INFO:master_logger:Epoch[022/800], Step[0300/0626], Avg Loss: 0.8239
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8234
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8226
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8230
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8239
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8240
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8231
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8231
+INFO:local_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8235
+INFO:master_logger:Epoch[022/800], Step[0400/0626], Avg Loss: 0.8233
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8231
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8231
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8222
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8226
+INFO:master_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8225
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8222
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8223
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8226
+INFO:local_logger:Epoch[022/800], Step[0500/0626], Avg Loss: 0.8222
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8221
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8219
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8219
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8220
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8223
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8226
+INFO:master_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8222
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8228
+INFO:local_logger:Epoch[022/800], Step[0600/0626], Avg Loss: 0.8222
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8221, time: 859.97
+INFO:master_logger:----- Epoch[022/800], Train Loss: 0.8221, time: 859.97
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8221, time: 863.27
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8219, time: 864.02
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8219, time: 863.88
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8227, time: 863.94
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8220, time: 863.94
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8223, time: 863.94
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Epoch[022/800], Train Loss: 0.8219, time: 863.94
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-22-Loss-0.8221496500572387.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-22-Loss-0.8221496500572387.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-22-Loss-0.8221496500572387.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-22-Loss-0.8221496500572387.pdopt
+INFO:local_logger:Now training epoch 23. LR=0.000087
+INFO:master_logger:Now training epoch 23. LR=0.000087
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8185
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8079
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8203
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8216
+INFO:master_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8178
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8182
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8154
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8235
+INFO:local_logger:Epoch[023/800], Step[0000/0626], Avg Loss: 0.8168
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8184
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8174
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8173
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8176
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8182
+INFO:master_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8177
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8171
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8178
+INFO:local_logger:Epoch[023/800], Step[0100/0626], Avg Loss: 0.8173
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8173
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8172
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8169
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8168
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8171
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8171
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8172
+INFO:local_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8171
+INFO:master_logger:Epoch[023/800], Step[0200/0626], Avg Loss: 0.8171
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8166
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8163
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8166
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8167
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8164
+INFO:master_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8166
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8170
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8170
+INFO:local_logger:Epoch[023/800], Step[0300/0626], Avg Loss: 0.8166
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8161
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8159
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8161
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8159
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8158
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8165
+INFO:master_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8161
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8163
+INFO:local_logger:Epoch[023/800], Step[0400/0626], Avg Loss: 0.8165
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8157
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8160
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8161
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8155
+INFO:master_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8157
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8154
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8155
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8156
+INFO:local_logger:Epoch[023/800], Step[0500/0626], Avg Loss: 0.8155
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8151
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8149
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8154
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8151
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8152
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8149
+INFO:master_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8151
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8151
+INFO:local_logger:Epoch[023/800], Step[0600/0626], Avg Loss: 0.8153
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8150, time: 884.65
+INFO:master_logger:----- Epoch[023/800], Train Loss: 0.8150, time: 884.65
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8151, time: 888.50
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8152, time: 889.17
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8152, time: 888.59
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8148, time: 888.85
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8149, time: 888.51
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8149, time: 888.52
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Epoch[023/800], Train Loss: 0.8146, time: 888.53
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-23-Loss-0.8150022067021212.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-23-Loss-0.8150022067021212.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-23-Loss-0.8150022067021212.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-23-Loss-0.8150022067021212.pdopt
+INFO:local_logger:Now training epoch 24. LR=0.000090
+INFO:master_logger:Now training epoch 24. LR=0.000090
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8150
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8170
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8108
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8043
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8090
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8224
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8215
+INFO:master_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8135
+INFO:local_logger:Epoch[024/800], Step[0000/0626], Avg Loss: 0.8082
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8115
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8115
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8110
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8112
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8117
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8124
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8113
+INFO:master_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8114
+INFO:local_logger:Epoch[024/800], Step[0100/0626], Avg Loss: 0.8105
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8104
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8110
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8111
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8115
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8114
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8115
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8108
+INFO:local_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8112
+INFO:master_logger:Epoch[024/800], Step[0200/0626], Avg Loss: 0.8111
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8101
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8100
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8106
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8106
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8099
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8100
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8101
+INFO:local_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8107
+INFO:master_logger:Epoch[024/800], Step[0300/0626], Avg Loss: 0.8103
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8096
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8096
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8096
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8099
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8098
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8097
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8101
+INFO:master_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8098
+INFO:local_logger:Epoch[024/800], Step[0400/0626], Avg Loss: 0.8103
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8089
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8095
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8094
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8090
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8094
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8090
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8091
+INFO:master_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8092
+INFO:local_logger:Epoch[024/800], Step[0500/0626], Avg Loss: 0.8090
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8088
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8088
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8093
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8088
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8087
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8091
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8091
+INFO:master_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8089
+INFO:local_logger:Epoch[024/800], Step[0600/0626], Avg Loss: 0.8088
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8087, time: 870.26
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8084, time: 870.28
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8092, time: 867.64
+INFO:master_logger:----- Epoch[024/800], Train Loss: 0.8088, time: 867.64
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8090, time: 870.78
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8089, time: 870.77
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8086, time: 870.75
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8086, time: 870.78
+INFO:local_logger:----- Epoch[024/800], Train Loss: 0.8087, time: 870.75
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-24-Loss-0.8091736378081739.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-24-Loss-0.8091736378081739.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-24-Loss-0.8091736378081739.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-24-Loss-0.8091736378081739.pdopt
+INFO:local_logger:Now training epoch 25. LR=0.000094
+INFO:master_logger:Now training epoch 25. LR=0.000094
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8030
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8073
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.7969
+INFO:master_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8051
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8085
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8000
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8034
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8158
+INFO:local_logger:Epoch[025/800], Step[0000/0626], Avg Loss: 0.8061
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8055
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8046
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8059
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8052
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8053
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8054
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8052
+INFO:local_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8057
+INFO:master_logger:Epoch[025/800], Step[0100/0626], Avg Loss: 0.8053
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8049
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8050
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8051
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8047
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8054
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8050
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8057
+INFO:master_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8051
+INFO:local_logger:Epoch[025/800], Step[0200/0626], Avg Loss: 0.8051
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8052
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8049
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8046
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8044
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8048
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8046
+INFO:master_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8047
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8044
+INFO:local_logger:Epoch[025/800], Step[0300/0626], Avg Loss: 0.8044
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8043
+INFO:master_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8042
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8042
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8043
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8040
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8045
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8038
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8043
+INFO:local_logger:Epoch[025/800], Step[0400/0626], Avg Loss: 0.8043
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8040
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8039
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8032
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8034
+INFO:master_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8036
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8037
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8037
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8037
+INFO:local_logger:Epoch[025/800], Step[0500/0626], Avg Loss: 0.8035
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8036
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8029
+INFO:master_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8033
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8032
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8032
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8030
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8035
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8036
+INFO:local_logger:Epoch[025/800], Step[0600/0626], Avg Loss: 0.8035
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8033, time: 888.93
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8035, time: 885.88
+INFO:master_logger:----- Epoch[025/800], Train Loss: 0.8032, time: 885.88
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8029, time: 889.72
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8033, time: 889.81
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8031, time: 889.83
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8031, time: 890.36
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8035, time: 890.36
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Epoch[025/800], Train Loss: 0.8028, time: 889.87
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-25-Loss-0.8034991228641365.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-25-Loss-0.8034991228641365.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-25-Loss-0.8034991228641365.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-25-Loss-0.8034991228641365.pdopt
+INFO:local_logger:Now training epoch 26. LR=0.000098
+INFO:master_logger:Now training epoch 26. LR=0.000098
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.7943
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.7989
+INFO:master_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.7988
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.7899
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.7949
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.7996
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.8022
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.8063
+INFO:local_logger:Epoch[026/800], Step[0000/0626], Avg Loss: 0.8043
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7989
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7997
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7993
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7994
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7998
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7976
+INFO:master_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7992
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7992
+INFO:local_logger:Epoch[026/800], Step[0100/0626], Avg Loss: 0.7992
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7993
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7986
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7985
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7986
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7987
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7988
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7980
+INFO:master_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7988
+INFO:local_logger:Epoch[026/800], Step[0200/0626], Avg Loss: 0.7999
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7980
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7983
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7983
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7992
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7980
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7983
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7985
+INFO:master_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7983
+INFO:local_logger:Epoch[026/800], Step[0300/0626], Avg Loss: 0.7979
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7977
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7973
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7976
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7977
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7975
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7976
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7979
+INFO:master_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7977
+INFO:local_logger:Epoch[026/800], Step[0400/0626], Avg Loss: 0.7984
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7970
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7967
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7967
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7976
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7970
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7965
+INFO:master_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7969
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7968
+INFO:local_logger:Epoch[026/800], Step[0500/0626], Avg Loss: 0.7969
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7961
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7961
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7959
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7961
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7969
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7961
+INFO:master_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7962
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7963
+INFO:local_logger:Epoch[026/800], Step[0600/0626], Avg Loss: 0.7964
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7968, time: 871.34
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7961, time: 867.96
+INFO:master_logger:----- Epoch[026/800], Train Loss: 0.7962, time: 867.96
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7959, time: 871.51
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7960, time: 871.97
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7960, time: 871.99
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7963, time: 872.03
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7962, time: 872.93
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Epoch[026/800], Train Loss: 0.7961, time: 872.05
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-26-Loss-0.796081899062454.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-26-Loss-0.796081899062454.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-26-Loss-0.796081899062454.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-26-Loss-0.796081899062454.pdopt
+INFO:local_logger:Now training epoch 27. LR=0.000102
+INFO:master_logger:Now training epoch 27. LR=0.000102
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7991
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.8009
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7909
+INFO:master_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7963
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.8054
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7875
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7990
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7946
+INFO:local_logger:Epoch[027/800], Step[0000/0626], Avg Loss: 0.7932
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7923
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7922
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7923
+INFO:master_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7920
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7924
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7913
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7925
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7917
+INFO:local_logger:Epoch[027/800], Step[0100/0626], Avg Loss: 0.7914
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7922
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7922
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7924
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7924
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7915
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7919
+INFO:master_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7920
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7914
+INFO:local_logger:Epoch[027/800], Step[0200/0626], Avg Loss: 0.7921
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7914
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7917
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7905
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7909
+INFO:master_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7911
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7907
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7915
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7912
+INFO:local_logger:Epoch[027/800], Step[0300/0626], Avg Loss: 0.7909
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7909
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7900
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7909
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7902
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7904
+INFO:master_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7906
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7907
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7903
+INFO:local_logger:Epoch[027/800], Step[0400/0626], Avg Loss: 0.7911
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7902
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7899
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7898
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7903
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7903
+INFO:master_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7900
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7904
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7895
+INFO:local_logger:Epoch[027/800], Step[0500/0626], Avg Loss: 0.7898
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7896
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7893
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7898
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7894
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7893
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7899
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7891
+INFO:local_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7894
+INFO:master_logger:Epoch[027/800], Step[0600/0626], Avg Loss: 0.7895
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7892, time: 878.78
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7897, time: 878.81
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7894, time: 875.73
+INFO:master_logger:----- Epoch[027/800], Train Loss: 0.7893, time: 875.73
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7890, time: 878.88
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7892, time: 878.97
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7891, time: 879.29
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7895, time: 879.32
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Epoch[027/800], Train Loss: 0.7895, time: 879.32
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-27-Loss-0.7894227700390196.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-27-Loss-0.7894227700390196.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-27-Loss-0.7894227700390196.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-27-Loss-0.7894227700390196.pdopt
+INFO:local_logger:Now training epoch 28. LR=0.000105
+INFO:master_logger:Now training epoch 28. LR=0.000105
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7943
+INFO:master_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7880
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7931
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7814
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7872
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7910
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7793
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7897
+INFO:local_logger:Epoch[028/800], Step[0000/0626], Avg Loss: 0.7883
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7855
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7868
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7873
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7853
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7862
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7875
+INFO:master_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7866
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7874
+INFO:local_logger:Epoch[028/800], Step[0100/0626], Avg Loss: 0.7871
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7865
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7866
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7859
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7855
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7855
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7858
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7860
+INFO:master_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7861
+INFO:local_logger:Epoch[028/800], Step[0200/0626], Avg Loss: 0.7865
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7850
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7850
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7851
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7845
+INFO:master_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7851
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7851
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7855
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7846
+INFO:local_logger:Epoch[028/800], Step[0300/0626], Avg Loss: 0.7860
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7841
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7844
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7839
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7850
+INFO:master_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7843
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7837
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7843
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7843
+INFO:local_logger:Epoch[028/800], Step[0400/0626], Avg Loss: 0.7845
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7842
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7838
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7835
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7841
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7840
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7839
+INFO:master_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7840
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7846
+INFO:local_logger:Epoch[028/800], Step[0500/0626], Avg Loss: 0.7837
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7838
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7831
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7837
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7836
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7836
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7840
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7839
+INFO:local_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7833
+INFO:master_logger:Epoch[028/800], Step[0600/0626], Avg Loss: 0.7836
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7837, time: 871.38
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7835, time: 868.22
+INFO:master_logger:----- Epoch[028/800], Train Loss: 0.7835, time: 868.22
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7831, time: 872.18
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7830, time: 873.00
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7835, time: 871.85
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7834, time: 871.89
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7837, time: 873.03
+INFO:local_logger:----- Epoch[028/800], Train Loss: 0.7838, time: 872.29
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-28-Loss-0.783455797444855.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-28-Loss-0.783455797444855.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-28-Loss-0.783455797444855.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-28-Loss-0.783455797444855.pdopt
+INFO:local_logger:Now training epoch 29. LR=0.000109
+INFO:master_logger:Now training epoch 29. LR=0.000109
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7766
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7720
+INFO:master_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7785
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7816
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7862
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7662
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7767
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7916
+INFO:local_logger:Epoch[029/800], Step[0000/0626], Avg Loss: 0.7771
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7823
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7822
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7822
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7816
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7818
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7815
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7809
+INFO:master_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7818
+INFO:local_logger:Epoch[029/800], Step[0100/0626], Avg Loss: 0.7822
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7802
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7802
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7795
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7806
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7806
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7807
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7800
+INFO:local_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7800
+INFO:master_logger:Epoch[029/800], Step[0200/0626], Avg Loss: 0.7802
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7794
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7794
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7796
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7798
+INFO:master_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7795
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7797
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7796
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7797
+INFO:local_logger:Epoch[029/800], Step[0300/0626], Avg Loss: 0.7789
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7788
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7781
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7786
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7787
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7785
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7787
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7786
+INFO:master_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7786
+INFO:local_logger:Epoch[029/800], Step[0400/0626], Avg Loss: 0.7788
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7782
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7776
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7779
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7781
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7782
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7780
+INFO:master_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7781
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7781
+INFO:local_logger:Epoch[029/800], Step[0500/0626], Avg Loss: 0.7782
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7776
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7777
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7777
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7778
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7776
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7776
+INFO:master_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7776
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7772
+INFO:local_logger:Epoch[029/800], Step[0600/0626], Avg Loss: 0.7776
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7778, time: 869.47
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7776, time: 865.73
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7773, time: 868.98
+INFO:master_logger:----- Epoch[029/800], Train Loss: 0.7776, time: 865.73
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7777, time: 869.01
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7777, time: 869.04
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7774, time: 869.04
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7776, time: 869.04
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Epoch[029/800], Train Loss: 0.7776, time: 869.04
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-29-Loss-0.77762673581754.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-29-Loss-0.77762673581754.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-29-Loss-0.77762673581754.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-29-Loss-0.77762673581754.pdopt
+INFO:local_logger:Now training epoch 30. LR=0.000113
+INFO:master_logger:Now training epoch 30. LR=0.000113
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7671
+INFO:master_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7736
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7764
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7666
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7815
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7786
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7819
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7683
+INFO:local_logger:Epoch[030/800], Step[0000/0626], Avg Loss: 0.7686
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7759
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7765
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7745
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7773
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7759
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7772
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7763
+INFO:master_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7762
+INFO:local_logger:Epoch[030/800], Step[0100/0626], Avg Loss: 0.7756
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7744
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7748
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7754
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7743
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7742
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7749
+INFO:master_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7747
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7747
+INFO:local_logger:Epoch[030/800], Step[0200/0626], Avg Loss: 0.7747
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7740
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7748
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7739
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7744
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7741
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7739
+INFO:master_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7741
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7744
+INFO:local_logger:Epoch[030/800], Step[0300/0626], Avg Loss: 0.7736
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7737
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7736
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7732
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7735
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7740
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7732
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7734
+INFO:local_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7732
+INFO:master_logger:Epoch[030/800], Step[0400/0626], Avg Loss: 0.7735
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7730
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7729
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7729
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7732
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7731
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7727
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7729
+INFO:local_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7725
+INFO:master_logger:Epoch[030/800], Step[0500/0626], Avg Loss: 0.7729
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7723
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7723
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7722
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7720
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7726
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7724
+INFO:master_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7722
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7723
+INFO:local_logger:Epoch[030/800], Step[0600/0626], Avg Loss: 0.7719
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7722, time: 884.56
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7718, time: 884.68
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7719, time: 881.45
+INFO:master_logger:----- Epoch[030/800], Train Loss: 0.7721, time: 881.45
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7721, time: 885.20
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7721, time: 885.20
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7722, time: 885.14
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7724, time: 885.17
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:----- Epoch[030/800], Train Loss: 0.7723, time: 885.20
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-30-Loss-0.7718740027551073.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-30-Loss-0.7718740027551073.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-30-Loss-0.7718740027551073.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-30-Loss-0.7718740027551073.pdopt
+INFO:local_logger:Now training epoch 31. LR=0.000116
+INFO:master_logger:Now training epoch 31. LR=0.000116
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7753
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7631
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7707
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7713
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7666
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7545
+INFO:master_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7651
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7590
+INFO:local_logger:Epoch[031/800], Step[0000/0626], Avg Loss: 0.7607
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7683
+INFO:master_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7684
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7667
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7685
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7688
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7677
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7694
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7688
+INFO:local_logger:Epoch[031/800], Step[0100/0626], Avg Loss: 0.7686
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7679
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7681
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7678
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7680
+INFO:master_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7680
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7683
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7685
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7681
+INFO:local_logger:Epoch[031/800], Step[0200/0626], Avg Loss: 0.7672
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7678
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7684
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7676
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7680
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7676
+INFO:master_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7677
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7668
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7674
+INFO:local_logger:Epoch[031/800], Step[0300/0626], Avg Loss: 0.7679
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7667
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7676
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7674
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7676
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7675
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7678
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7683
+INFO:local_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7678
+INFO:master_logger:Epoch[031/800], Step[0400/0626], Avg Loss: 0.7676
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7669
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7673
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7674
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7670
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7672
+INFO:master_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7671
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7663
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7671
+INFO:local_logger:Epoch[031/800], Step[0500/0626], Avg Loss: 0.7676
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7669
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7668
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7664
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7667
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7661
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7667
+INFO:master_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7667
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7670
+INFO:local_logger:Epoch[031/800], Step[0600/0626], Avg Loss: 0.7669
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7663, time: 868.34
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7660, time: 868.35
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7666, time: 868.93
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7667, time: 868.45
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7668, time: 868.50
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7666, time: 864.87
+INFO:master_logger:----- Epoch[031/800], Train Loss: 0.7665, time: 864.87
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7668, time: 869.11
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Epoch[031/800], Train Loss: 0.7665, time: 868.66
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-31-Loss-0.7666413477179265.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-31-Loss-0.7666413477179265.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-31-Loss-0.7666413477179265.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-31-Loss-0.7666413477179265.pdopt
+INFO:local_logger:Now training epoch 32. LR=0.000120
+INFO:master_logger:Now training epoch 32. LR=0.000120
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7643
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7629
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7742
+INFO:master_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7666
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7638
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7575
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7728
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7710
+INFO:local_logger:Epoch[032/800], Step[0000/0626], Avg Loss: 0.7662
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7641
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7647
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7646
+INFO:master_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7645
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7645
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7638
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7636
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7646
+INFO:local_logger:Epoch[032/800], Step[0100/0626], Avg Loss: 0.7657
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7632
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7627
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7635
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7633
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7638
+INFO:master_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7635
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7637
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7629
+INFO:local_logger:Epoch[032/800], Step[0200/0626], Avg Loss: 0.7644
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7629
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7628
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7626
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7631
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7633
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7634
+INFO:master_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7629
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7626
+INFO:local_logger:Epoch[032/800], Step[0300/0626], Avg Loss: 0.7628
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7623
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7625
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7620
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7630
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7621
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7628
+INFO:master_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7624
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7622
+INFO:local_logger:Epoch[032/800], Step[0400/0626], Avg Loss: 0.7621
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7617
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7623
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7623
+INFO:master_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7619
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7618
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7617
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7619
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7617
+INFO:local_logger:Epoch[032/800], Step[0500/0626], Avg Loss: 0.7617
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7613
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7613
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7616
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7612
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7615
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7612
+INFO:master_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7614
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7616
+INFO:local_logger:Epoch[032/800], Step[0600/0626], Avg Loss: 0.7611
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7611, time: 877.39
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7610, time: 878.25
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7609, time: 878.35
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7615, time: 874.33
+INFO:master_logger:----- Epoch[032/800], Train Loss: 0.7612, time: 874.33
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7612, time: 878.35
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7612, time: 878.38
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7612, time: 878.23
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Epoch[032/800], Train Loss: 0.7615, time: 878.10
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-32-Loss-0.761453199911086.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-32-Loss-0.761453199911086.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-32-Loss-0.761453199911086.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-32-Loss-0.761453199911086.pdopt
+INFO:local_logger:Now training epoch 33. LR=0.000124
+INFO:master_logger:Now training epoch 33. LR=0.000124
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7557
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7650
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7551
+INFO:master_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7556
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7455
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7599
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7565
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7625
+INFO:local_logger:Epoch[033/800], Step[0000/0626], Avg Loss: 0.7448
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7586
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7590
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7584
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7588
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7580
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7570
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7575
+INFO:local_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7570
+INFO:master_logger:Epoch[033/800], Step[0100/0626], Avg Loss: 0.7580
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7577
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7575
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7584
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7580
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7584
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7571
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7574
+INFO:master_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7578
+INFO:local_logger:Epoch[033/800], Step[0200/0626], Avg Loss: 0.7576
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7572
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7579
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7576
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7568
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7577
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7579
+INFO:master_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7575
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7579
+INFO:local_logger:Epoch[033/800], Step[0300/0626], Avg Loss: 0.7572
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7574
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7571
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7568
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7573
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7571
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7573
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7572
+INFO:local_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7565
+INFO:master_logger:Epoch[033/800], Step[0400/0626], Avg Loss: 0.7571
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7565
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7567
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7570
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7569
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7564
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7570
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7569
+INFO:master_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7568
+INFO:local_logger:Epoch[033/800], Step[0500/0626], Avg Loss: 0.7569
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7564
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7561
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7567
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7561
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7566
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7567
+INFO:master_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7564
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7563
+INFO:local_logger:Epoch[033/800], Step[0600/0626], Avg Loss: 0.7566
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7561, time: 852.50
+INFO:master_logger:----- Epoch[033/800], Train Loss: 0.7564, time: 852.50
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7566, time: 856.70
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7566, time: 856.72
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7560, time: 856.75
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7566, time: 857.55
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7562, time: 856.91
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7565, time: 856.90
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Epoch[033/800], Train Loss: 0.7563, time: 856.92
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-33-Loss-0.7561021761674307.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-33-Loss-0.7561021761674307.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-33-Loss-0.7561021761674307.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-33-Loss-0.7561021761674307.pdopt
+INFO:local_logger:Now training epoch 34. LR=0.000128
+INFO:master_logger:Now training epoch 34. LR=0.000128
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7407
+INFO:master_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7518
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7494
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7433
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7598
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7563
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7612
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7477
+INFO:local_logger:Epoch[034/800], Step[0000/0626], Avg Loss: 0.7559
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7547
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7548
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7543
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7539
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7534
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7536
+INFO:master_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7540
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7534
+INFO:local_logger:Epoch[034/800], Step[0100/0626], Avg Loss: 0.7540
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7539
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7529
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7541
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7541
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7531
+INFO:master_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7536
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7536
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7534
+INFO:local_logger:Epoch[034/800], Step[0200/0626], Avg Loss: 0.7533
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7529
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7533
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7532
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7531
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7528
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7534
+INFO:master_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7531
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7539
+INFO:local_logger:Epoch[034/800], Step[0300/0626], Avg Loss: 0.7523
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7527
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7527
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7526
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7520
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7536
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7525
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7525
+INFO:local_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7528
+INFO:master_logger:Epoch[034/800], Step[0400/0626], Avg Loss: 0.7527
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7526
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7520
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7518
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7523
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7525
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7524
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7522
+INFO:local_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7532
+INFO:master_logger:Epoch[034/800], Step[0500/0626], Avg Loss: 0.7524
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7520
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7521
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7516
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7522
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7516
+INFO:master_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7520
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7521
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7526
+INFO:local_logger:Epoch[034/800], Step[0600/0626], Avg Loss: 0.7516
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7519, time: 885.18
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7520, time: 885.17
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7519, time: 885.32
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7516, time: 882.34
+INFO:master_logger:----- Epoch[034/800], Train Loss: 0.7519, time: 882.34
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7522, time: 885.50
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7516, time: 885.48
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7516, time: 885.48
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Epoch[034/800], Train Loss: 0.7524, time: 885.48
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-34-Loss-0.7516406552539668.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-34-Loss-0.7516406552539668.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-34-Loss-0.7516406552539668.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-34-Loss-0.7516406552539668.pdopt
+INFO:local_logger:Now training epoch 35. LR=0.000131
+INFO:master_logger:Now training epoch 35. LR=0.000131
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7487
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7520
+INFO:master_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7511
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7453
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7505
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7633
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7544
+INFO:local_logger:Epoch[035/800], Step[0000/0626], Avg Loss: 0.7535
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7500
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7504
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7496
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7499
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7493
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7492
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7490
+INFO:master_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7494
+INFO:local_logger:Epoch[035/800], Step[0100/0626], Avg Loss: 0.7481
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7496
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7496
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7495
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7485
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7487
+INFO:master_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7493
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7490
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7498
+INFO:local_logger:Epoch[035/800], Step[0200/0626], Avg Loss: 0.7493
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7484
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7486
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7491
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7494
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7494
+INFO:master_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7490
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7491
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7488
+INFO:local_logger:Epoch[035/800], Step[0300/0626], Avg Loss: 0.7495
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7488
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7490
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7483
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7490
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7484
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7489
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7480
+INFO:master_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7486
+INFO:local_logger:Epoch[035/800], Step[0400/0626], Avg Loss: 0.7484
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7485
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7488
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7483
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7482
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7479
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7482
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7491
+INFO:master_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7484
+INFO:local_logger:Epoch[035/800], Step[0500/0626], Avg Loss: 0.7484
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7482
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7485
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7478
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7480
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7480
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7481
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7478
+INFO:local_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7485
+INFO:master_logger:Epoch[035/800], Step[0600/0626], Avg Loss: 0.7481
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7484, time: 858.24
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7479, time: 859.58
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7481, time: 859.55
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7480, time: 859.83
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7480, time: 859.45
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7484, time: 859.45
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7477, time: 855.75
+INFO:master_logger:----- Epoch[035/800], Train Loss: 0.7480, time: 855.75
+INFO:local_logger:----- Epoch[035/800], Train Loss: 0.7478, time: 859.47
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-35-Loss-0.7477136553201034.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-35-Loss-0.7477136553201034.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-35-Loss-0.7477136553201034.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-35-Loss-0.7477136553201034.pdopt
+INFO:local_logger:Now training epoch 36. LR=0.000135
+INFO:master_logger:Now training epoch 36. LR=0.000135
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7546
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7470
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7469
+INFO:master_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7497
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7536
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7544
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7537
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7427
+INFO:local_logger:Epoch[036/800], Step[0000/0626], Avg Loss: 0.7446
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7474
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7461
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7458
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7467
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7469
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7457
+INFO:master_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7463
+INFO:local_logger:Epoch[036/800], Step[0100/0626], Avg Loss: 0.7461
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7459
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7473
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7459
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7461
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7460
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7450
+INFO:master_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7460
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7464
+INFO:local_logger:Epoch[036/800], Step[0200/0626], Avg Loss: 0.7453
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7464
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7457
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7461
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7456
+INFO:master_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7451
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[036/800], Step[0300/0626], Avg Loss: 0.7450
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7453
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7449
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7457
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7450
+INFO:master_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7453
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7452
+INFO:local_logger:Epoch[036/800], Step[0400/0626], Avg Loss: 0.7449
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7452
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7450
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7453
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7449
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7449
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7452
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7448
+INFO:local_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7451
+INFO:master_logger:Epoch[036/800], Step[0500/0626], Avg Loss: 0.7451
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7446
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7446
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7446
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7449
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7446
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7444
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7450
+INFO:local_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7450
+INFO:master_logger:Epoch[036/800], Step[0600/0626], Avg Loss: 0.7447
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7449, time: 890.54
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7445, time: 891.48
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7445, time: 891.09
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7446, time: 887.41
+INFO:master_logger:----- Epoch[036/800], Train Loss: 0.7447, time: 887.41
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7448, time: 891.48
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7449, time: 891.12
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7446, time: 892.33
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:----- Epoch[036/800], Train Loss: 0.7444, time: 891.12
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-36-Loss-0.7446498155174043.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-36-Loss-0.7446498155174043.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-36-Loss-0.7446498155174043.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-36-Loss-0.7446498155174043.pdopt
+INFO:local_logger:Now training epoch 37. LR=0.000139
+INFO:master_logger:Now training epoch 37. LR=0.000139
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7435
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7441
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7454
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7397
+INFO:master_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7456
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7399
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7476
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7504
+INFO:local_logger:Epoch[037/800], Step[0000/0626], Avg Loss: 0.7542
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7433
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7429
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7416
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7421
+INFO:master_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7426
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7423
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7431
+INFO:local_logger:Epoch[037/800], Step[0100/0626], Avg Loss: 0.7438
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7429
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7434
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7425
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7427
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7427
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7428
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7429
+INFO:local_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7424
+INFO:master_logger:Epoch[037/800], Step[0200/0626], Avg Loss: 0.7428
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7422
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7425
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7417
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7424
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7423
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7424
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7427
+INFO:local_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7425
+INFO:master_logger:Epoch[037/800], Step[0300/0626], Avg Loss: 0.7423
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7423
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7419
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7416
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7420
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7422
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7420
+INFO:local_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7419
+INFO:master_logger:Epoch[037/800], Step[0400/0626], Avg Loss: 0.7419
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7416
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7419
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7417
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7419
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7415
+INFO:local_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7418
+INFO:master_logger:Epoch[037/800], Step[0500/0626], Avg Loss: 0.7417
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7410
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7416
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7416
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7417
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7413
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7413
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7412
+INFO:master_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[037/800], Step[0600/0626], Avg Loss: 0.7414
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7415, time: 861.30
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7415, time: 861.30
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7414, time: 861.29
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7412, time: 861.53
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7412, time: 862.22
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7411, time: 861.67
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7410, time: 861.66
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:----- Epoch[037/800], Train Loss: 0.7415, time: 858.01
+INFO:master_logger:----- Epoch[037/800], Train Loss: 0.7413, time: 858.01
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-37-Loss-0.7415279359559235.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-37-Loss-0.7415279359559235.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-37-Loss-0.7415279359559235.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-37-Loss-0.7415279359559235.pdopt
+INFO:local_logger:Now training epoch 38. LR=0.000143
+INFO:master_logger:Now training epoch 38. LR=0.000143
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7326
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7470
+INFO:master_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7394
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7295
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7452
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7429
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7354
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7382
+INFO:local_logger:Epoch[038/800], Step[0000/0626], Avg Loss: 0.7443
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7383
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7398
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7393
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7389
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7386
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7387
+INFO:master_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7392
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7411
+INFO:local_logger:Epoch[038/800], Step[0100/0626], Avg Loss: 0.7392
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7386
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7386
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7399
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7386
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7391
+INFO:master_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7389
+INFO:local_logger:Epoch[038/800], Step[0200/0626], Avg Loss: 0.7389
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7390
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7386
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7384
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7385
+INFO:master_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7387
+INFO:local_logger:Epoch[038/800], Step[0300/0626], Avg Loss: 0.7389
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7382
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7390
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7387
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7388
+INFO:master_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7387
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7384
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7386
+INFO:local_logger:Epoch[038/800], Step[0400/0626], Avg Loss: 0.7388
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7383
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7385
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7385
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7385
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7390
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7377
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7385
+INFO:master_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7385
+INFO:local_logger:Epoch[038/800], Step[0500/0626], Avg Loss: 0.7387
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7378
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7382
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7389
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7381
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7381
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7381
+INFO:master_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7383
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7381
+INFO:local_logger:Epoch[038/800], Step[0600/0626], Avg Loss: 0.7386
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7381, time: 895.61
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7378, time: 895.58
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7380, time: 895.99
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7381, time: 896.21
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7381, time: 892.19
+INFO:master_logger:----- Epoch[038/800], Train Loss: 0.7382, time: 892.19
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7385, time: 895.95
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7388, time: 896.06
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Epoch[038/800], Train Loss: 0.7381, time: 895.94
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-38-Loss-0.7380608445157859.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-38-Loss-0.7380608445157859.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-38-Loss-0.7380608445157859.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-38-Loss-0.7380608445157859.pdopt
+INFO:local_logger:Now training epoch 39. LR=0.000146
+INFO:master_logger:Now training epoch 39. LR=0.000146
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7392
+INFO:master_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7382
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7373
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7385
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7375
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7414
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7378
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7312
+INFO:local_logger:Epoch[039/800], Step[0000/0626], Avg Loss: 0.7426
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7358
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7363
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7362
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7361
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7359
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7369
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7365
+INFO:master_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7363
+INFO:local_logger:Epoch[039/800], Step[0100/0626], Avg Loss: 0.7366
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7366
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7359
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7371
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7356
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7361
+INFO:master_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7362
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7363
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7364
+INFO:local_logger:Epoch[039/800], Step[0200/0626], Avg Loss: 0.7360
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7354
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7356
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7368
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7364
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7363
+INFO:master_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7360
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7357
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7359
+INFO:local_logger:Epoch[039/800], Step[0300/0626], Avg Loss: 0.7360
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7359
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7357
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7355
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7359
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7359
+INFO:master_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7358
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7365
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7355
+INFO:local_logger:Epoch[039/800], Step[0400/0626], Avg Loss: 0.7356
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7352
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7352
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7356
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7355
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7358
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7352
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7356
+INFO:master_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7354
+INFO:local_logger:Epoch[039/800], Step[0500/0626], Avg Loss: 0.7353
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7351
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7351
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7354
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7353
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7351
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7356
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7356
+INFO:master_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7354
+INFO:local_logger:Epoch[039/800], Step[0600/0626], Avg Loss: 0.7355
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7351, time: 865.25
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7350, time: 864.90
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7353, time: 860.87
+INFO:master_logger:----- Epoch[039/800], Train Loss: 0.7353, time: 860.87
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7352, time: 864.70
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7355, time: 864.66
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7355, time: 864.70
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7356, time: 864.70
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:----- Epoch[039/800], Train Loss: 0.7351, time: 865.02
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-39-Loss-0.7353187804344304.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-39-Loss-0.7353187804344304.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-39-Loss-0.7353187804344304.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-39-Loss-0.7353187804344304.pdopt
+INFO:local_logger:Now training epoch 40. LR=0.000150
+INFO:master_logger:Now training epoch 40. LR=0.000150
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7343
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7310
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7444
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7305
+INFO:master_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7355
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7357
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7310
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7439
+INFO:local_logger:Epoch[040/800], Step[0000/0626], Avg Loss: 0.7330
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7348
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7344
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7346
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7348
+INFO:master_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7341
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7345
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7336
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7331
+INFO:local_logger:Epoch[040/800], Step[0100/0626], Avg Loss: 0.7333
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7340
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7343
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7335
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7340
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7335
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7334
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7326
+INFO:master_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7336
+INFO:local_logger:Epoch[040/800], Step[0200/0626], Avg Loss: 0.7339
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7338
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7336
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7331
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7338
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7332
+INFO:master_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7334
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7329
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7334
+INFO:local_logger:Epoch[040/800], Step[0300/0626], Avg Loss: 0.7338
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7335
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7335
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7336
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7334
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7328
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7332
+INFO:master_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7333
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7332
+INFO:local_logger:Epoch[040/800], Step[0400/0626], Avg Loss: 0.7330
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7330
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7323
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7333
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7333
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7329
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7334
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7330
+INFO:local_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7329
+INFO:master_logger:Epoch[040/800], Step[0500/0626], Avg Loss: 0.7330
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7329
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7329
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7328
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7329
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7328
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7329
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7322
+INFO:master_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7328
+INFO:local_logger:Epoch[040/800], Step[0600/0626], Avg Loss: 0.7328
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7328, time: 895.01
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7328, time: 890.99
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7328, time: 895.12
+INFO:master_logger:----- Epoch[040/800], Train Loss: 0.7327, time: 890.99
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7328, time: 895.13
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7328, time: 894.99
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7328, time: 894.98
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7321, time: 894.99
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Epoch[040/800], Train Loss: 0.7327, time: 895.07
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-40-Loss-0.7327702230552797.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-40-Loss-0.7327702230552797.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-40-Loss-0.7327702230552797.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-40-Loss-0.7327702230552797.pdopt
+INFO:local_logger:Now training epoch 41. LR=0.000150
+INFO:master_logger:Now training epoch 41. LR=0.000150
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7292
+INFO:master_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7311
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7233
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7395
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7295
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7279
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7431
+INFO:local_logger:Epoch[041/800], Step[0000/0626], Avg Loss: 0.7341
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7312
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7315
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7304
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7317
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7307
+INFO:master_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7311
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7308
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7302
+INFO:local_logger:Epoch[041/800], Step[0100/0626], Avg Loss: 0.7321
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7308
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7315
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7308
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7317
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7303
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7313
+INFO:master_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7310
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7309
+INFO:local_logger:Epoch[041/800], Step[0200/0626], Avg Loss: 0.7308
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7310
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7307
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7304
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7307
+INFO:master_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7308
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7310
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7308
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7312
+INFO:local_logger:Epoch[041/800], Step[0300/0626], Avg Loss: 0.7304
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7312
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7309
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7301
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7305
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7305
+INFO:master_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7306
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7307
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7303
+INFO:local_logger:Epoch[041/800], Step[0400/0626], Avg Loss: 0.7307
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7303
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7302
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7306
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7305
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7300
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7306
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7304
+INFO:local_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7304
+INFO:master_logger:Epoch[041/800], Step[0500/0626], Avg Loss: 0.7304
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7303
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7304
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7301
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7299
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7302
+INFO:master_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7301
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7299
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7302
+INFO:local_logger:Epoch[041/800], Step[0600/0626], Avg Loss: 0.7301
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7305, time: 866.86
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7300, time: 866.90
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7304, time: 867.40
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7303, time: 863.76
+INFO:master_logger:----- Epoch[041/800], Train Loss: 0.7302, time: 863.76
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7301, time: 867.52
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7303, time: 867.54
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7302, time: 867.65
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Epoch[041/800], Train Loss: 0.7300, time: 867.66
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-41-Loss-0.7302703064055491.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-41-Loss-0.7302703064055491.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-41-Loss-0.7302703064055491.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-41-Loss-0.7302703064055491.pdopt
+INFO:local_logger:Now training epoch 42. LR=0.000150
+INFO:master_logger:Now training epoch 42. LR=0.000150
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7300
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7398
+INFO:master_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7281
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7322
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7185
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7206
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7231
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7259
+INFO:local_logger:Epoch[042/800], Step[0000/0626], Avg Loss: 0.7347
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7299
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7279
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7292
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7300
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7287
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7292
+INFO:master_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7291
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0100/0626], Avg Loss: 0.7297
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7285
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7286
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7290
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7289
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7292
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7282
+INFO:master_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7286
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7284
+INFO:local_logger:Epoch[042/800], Step[0200/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7288
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7277
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7284
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7286
+INFO:master_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7285
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7287
+INFO:local_logger:Epoch[042/800], Step[0300/0626], Avg Loss: 0.7288
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7279
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7285
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7287
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7288
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7275
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7284
+INFO:master_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0400/0626], Avg Loss: 0.7279
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7278
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7284
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7285
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7284
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7277
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7279
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7278
+INFO:local_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7285
+INFO:master_logger:Epoch[042/800], Step[0500/0626], Avg Loss: 0.7281
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7278
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7276
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7285
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7278
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7277
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7283
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7282
+INFO:master_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7280
+INFO:local_logger:Epoch[042/800], Step[0600/0626], Avg Loss: 0.7280
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7281, time: 892.19
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7280, time: 892.20
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7277, time: 892.57
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7276, time: 893.34
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7275, time: 892.63
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7283, time: 889.09
+INFO:master_logger:----- Epoch[042/800], Train Loss: 0.7279, time: 889.09
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7282, time: 893.43
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Epoch[042/800], Train Loss: 0.7279, time: 892.89
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-42-Loss-0.728333370828915.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-42-Loss-0.728333370828915.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-42-Loss-0.728333370828915.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-42-Loss-0.728333370828915.pdopt
+INFO:local_logger:Now training epoch 43. LR=0.000150
+INFO:master_logger:Now training epoch 43. LR=0.000150
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7259
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7157
+INFO:master_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7299
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7296
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7389
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7312
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7351
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7178
+INFO:local_logger:Epoch[043/800], Step[0000/0626], Avg Loss: 0.7452
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7266
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7263
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7264
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7253
+INFO:master_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7262
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7269
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7261
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0100/0626], Avg Loss: 0.7263
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7260
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7268
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7263
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7260
+INFO:master_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7260
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0200/0626], Avg Loss: 0.7258
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7262
+INFO:master_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7259
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7260
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7264
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7262
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0300/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7261
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7257
+INFO:master_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7258
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7254
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7253
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7260
+INFO:local_logger:Epoch[043/800], Step[0400/0626], Avg Loss: 0.7263
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7253
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7257
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7255
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7258
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7260
+INFO:master_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7253
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7255
+INFO:local_logger:Epoch[043/800], Step[0500/0626], Avg Loss: 0.7259
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7255
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7258
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7258
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7253
+INFO:master_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7256
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7254
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7252
+INFO:local_logger:Epoch[043/800], Step[0600/0626], Avg Loss: 0.7258
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7254, time: 856.34
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7256, time: 855.79
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7253, time: 856.30
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7258, time: 856.24
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7256, time: 852.55
+INFO:master_logger:----- Epoch[043/800], Train Loss: 0.7255, time: 852.55
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7252, time: 856.26
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7258, time: 856.83
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Epoch[043/800], Train Loss: 0.7257, time: 856.33
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-43-Loss-0.7255999422779928.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-43-Loss-0.7255999422779928.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-43-Loss-0.7255999422779928.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-43-Loss-0.7255999422779928.pdopt
+INFO:local_logger:Now training epoch 44. LR=0.000150
+INFO:master_logger:Now training epoch 44. LR=0.000150
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7265
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7228
+INFO:master_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7233
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7197
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7206
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7307
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7300
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[044/800], Step[0000/0626], Avg Loss: 0.7161
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7253
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7241
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7232
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7238
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7251
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7248
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7246
+INFO:local_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7250
+INFO:master_logger:Epoch[044/800], Step[0100/0626], Avg Loss: 0.7245
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7238
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7246
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7237
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7243
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7241
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7251
+INFO:master_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7243
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7249
+INFO:local_logger:Epoch[044/800], Step[0200/0626], Avg Loss: 0.7239
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7239
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7241
+INFO:master_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7240
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7239
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7241
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7235
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7242
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7247
+INFO:local_logger:Epoch[044/800], Step[0300/0626], Avg Loss: 0.7236
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7238
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7236
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7241
+INFO:master_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7238
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7236
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7233
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7243
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7240
+INFO:local_logger:Epoch[044/800], Step[0400/0626], Avg Loss: 0.7239
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7237
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7240
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7234
+INFO:master_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7236
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7238
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7231
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7238
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7237
+INFO:local_logger:Epoch[044/800], Step[0500/0626], Avg Loss: 0.7235
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7236
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7237
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7235
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7235
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7231
+INFO:master_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7235
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7237
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7235
+INFO:local_logger:Epoch[044/800], Step[0600/0626], Avg Loss: 0.7233
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7236, time: 893.49
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7231, time: 893.57
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7234, time: 893.55
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7234, time: 893.57
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7235, time: 889.85
+INFO:master_logger:----- Epoch[044/800], Train Loss: 0.7235, time: 889.85
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7237, time: 894.16
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7238, time: 894.14
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Epoch[044/800], Train Loss: 0.7235, time: 893.86
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-44-Loss-0.723522978396613.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-44-Loss-0.723522978396613.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-44-Loss-0.723522978396613.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-44-Loss-0.723522978396613.pdopt
+INFO:local_logger:Now training epoch 45. LR=0.000150
+INFO:master_logger:Now training epoch 45. LR=0.000150
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7270
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7223
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7158
+INFO:master_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7205
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7192
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7172
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7222
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7194
+INFO:local_logger:Epoch[045/800], Step[0000/0626], Avg Loss: 0.7208
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7221
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7229
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7221
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7228
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7213
+INFO:master_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7223
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7222
+INFO:local_logger:Epoch[045/800], Step[0100/0626], Avg Loss: 0.7230
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7218
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7221
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7231
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7222
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7228
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7222
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7222
+INFO:master_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7223
+INFO:local_logger:Epoch[045/800], Step[0200/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7220
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7218
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7220
+INFO:master_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7227
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7213
+INFO:local_logger:Epoch[045/800], Step[0300/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7217
+INFO:master_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7218
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7212
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7220
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0400/0626], Avg Loss: 0.7224
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7220
+INFO:master_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7218
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7216
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7218
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7213
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0500/0626], Avg Loss: 0.7223
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7212
+INFO:master_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7216
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7216
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7216
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7220
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7217
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7219
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7214
+INFO:local_logger:Epoch[045/800], Step[0600/0626], Avg Loss: 0.7213
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7212, time: 859.11
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7212, time: 855.42
+INFO:master_logger:----- Epoch[045/800], Train Loss: 0.7216, time: 855.42
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7216, time: 859.01
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7216, time: 859.54
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7219, time: 859.82
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7218, time: 859.92
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7214, time: 859.73
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:----- Epoch[045/800], Train Loss: 0.7217, time: 859.75
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-45-Loss-0.7212178304741391.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-45-Loss-0.7212178304741391.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-45-Loss-0.7212178304741391.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-45-Loss-0.7212178304741391.pdopt
+INFO:local_logger:Now training epoch 46. LR=0.000150
+INFO:master_logger:Now training epoch 46. LR=0.000150
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7165
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7255
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7341
+INFO:master_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7218
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7214
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7314
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7055
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7244
+INFO:local_logger:Epoch[046/800], Step[0000/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7201
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7210
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7219
+INFO:master_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7206
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7207
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7203
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7207
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7196
+INFO:local_logger:Epoch[046/800], Step[0100/0626], Avg Loss: 0.7207
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7191
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7207
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7206
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7209
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7203
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7207
+INFO:master_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7204
+INFO:local_logger:Epoch[046/800], Step[0200/0626], Avg Loss: 0.7208
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7205
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7204
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7201
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7202
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7202
+INFO:local_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7202
+INFO:master_logger:Epoch[046/800], Step[0300/0626], Avg Loss: 0.7202
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7200
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7205
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7202
+INFO:master_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7200
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7201
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0400/0626], Avg Loss: 0.7200
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7197
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7200
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7201
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7198
+INFO:master_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[046/800], Step[0500/0626], Avg Loss: 0.7204
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7197
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7203
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7197
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7199
+INFO:master_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[046/800], Step[0600/0626], Avg Loss: 0.7197
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7199, time: 894.85
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7197, time: 894.90
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7198, time: 895.63
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7197, time: 895.84
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7197, time: 895.79
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7198, time: 892.44
+INFO:master_logger:----- Epoch[046/800], Train Loss: 0.7198, time: 892.44
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7198, time: 895.48
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Epoch[046/800], Train Loss: 0.7204, time: 895.48
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-46-Loss-0.7197591380592546.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-46-Loss-0.7197591380592546.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-46-Loss-0.7197591380592546.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-46-Loss-0.7197591380592546.pdopt
+INFO:local_logger:Now training epoch 47. LR=0.000150
+INFO:master_logger:Now training epoch 47. LR=0.000150
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7210
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7275
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7216
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7147
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7261
+INFO:local_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7192
+INFO:master_logger:Epoch[047/800], Step[0000/0626], Avg Loss: 0.7199
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7184
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7187
+INFO:master_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7189
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7194
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7183
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7189
+INFO:local_logger:Epoch[047/800], Step[0100/0626], Avg Loss: 0.7196
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7184
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7190
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7180
+INFO:master_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7185
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7195
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7186
+INFO:local_logger:Epoch[047/800], Step[0200/0626], Avg Loss: 0.7192
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7182
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7187
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7182
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7179
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7184
+INFO:master_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7184
+INFO:local_logger:Epoch[047/800], Step[0300/0626], Avg Loss: 0.7184
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7183
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7180
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7188
+INFO:master_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7184
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7183
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7185
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0400/0626], Avg Loss: 0.7178
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7187
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7189
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7179
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7187
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7182
+INFO:master_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7183
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7179
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7183
+INFO:local_logger:Epoch[047/800], Step[0500/0626], Avg Loss: 0.7180
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7181
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7185
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7185
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7188
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7180
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7179
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7177
+INFO:local_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7181
+INFO:master_logger:Epoch[047/800], Step[0600/0626], Avg Loss: 0.7182
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7188, time: 864.28
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7180, time: 863.82
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7183, time: 863.91
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7185, time: 864.34
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7177, time: 864.67
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7179, time: 864.05
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7182, time: 864.42
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:----- Epoch[047/800], Train Loss: 0.7181, time: 860.27
+INFO:master_logger:----- Epoch[047/800], Train Loss: 0.7182, time: 860.27
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-47-Loss-0.7181137765215982.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-47-Loss-0.7181137765215982.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-47-Loss-0.7181137765215982.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-47-Loss-0.7181137765215982.pdopt
+INFO:local_logger:Now training epoch 48. LR=0.000150
+INFO:master_logger:Now training epoch 48. LR=0.000150
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7296
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7261
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7148
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7056
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7238
+INFO:master_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7205
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7255
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7134
+INFO:local_logger:Epoch[048/800], Step[0000/0626], Avg Loss: 0.7255
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7177
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7189
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7174
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7174
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7175
+INFO:master_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7177
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7172
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7192
+INFO:local_logger:Epoch[048/800], Step[0100/0626], Avg Loss: 0.7164
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7174
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7176
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7172
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7181
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7172
+INFO:master_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7174
+INFO:local_logger:Epoch[048/800], Step[0200/0626], Avg Loss: 0.7183
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7169
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7173
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7171
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7171
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7168
+INFO:master_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7170
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7172
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7164
+INFO:local_logger:Epoch[048/800], Step[0300/0626], Avg Loss: 0.7176
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7171
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7171
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7172
+INFO:master_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7170
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7173
+INFO:local_logger:Epoch[048/800], Step[0400/0626], Avg Loss: 0.7170
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7171
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7169
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7167
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7169
+INFO:master_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7164
+INFO:local_logger:Epoch[048/800], Step[0500/0626], Avg Loss: 0.7170
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7165
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7168
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7162
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7166
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7168
+INFO:master_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7166
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7166
+INFO:local_logger:Epoch[048/800], Step[0600/0626], Avg Loss: 0.7168
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7162, time: 894.29
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7165, time: 894.54
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7169, time: 894.25
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7165, time: 894.58
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7168, time: 894.96
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7165, time: 891.19
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7169, time: 894.94
+INFO:master_logger:----- Epoch[048/800], Train Loss: 0.7166, time: 891.19
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Epoch[048/800], Train Loss: 0.7167, time: 895.17
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-48-Loss-0.7164831838117034.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-48-Loss-0.7164831838117034.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-48-Loss-0.7164831838117034.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-48-Loss-0.7164831838117034.pdopt
+INFO:local_logger:Now training epoch 49. LR=0.000150
+INFO:master_logger:Now training epoch 49. LR=0.000150
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7106
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7136
+INFO:master_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7180
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7324
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7095
+INFO:local_logger:Epoch[049/800], Step[0000/0626], Avg Loss: 0.7095
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7149
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7149
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7157
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7161
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7156
+INFO:master_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7150
+INFO:local_logger:Epoch[049/800], Step[0100/0626], Avg Loss: 0.7151
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7158
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7151
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7161
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7155
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7152
+INFO:master_logger:Epoch[049/800], Step[0200/0626], Avg Loss: 0.7155
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7155
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7159
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7162
+INFO:master_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7157
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7152
+INFO:local_logger:Epoch[049/800], Step[0300/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7155
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7150
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7160
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7158
+INFO:master_logger:Epoch[049/800], Step[0400/0626], Avg Loss: 0.7155
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7151
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7156
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7155
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7156
+INFO:master_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[049/800], Step[0500/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7151
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7151
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7154
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7152
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7151
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7150
+INFO:master_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7152
+INFO:local_logger:Epoch[049/800], Step[0600/0626], Avg Loss: 0.7153
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7150, time: 858.69
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7151, time: 857.91
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7152, time: 858.69
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7152, time: 857.99
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7151, time: 858.78
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7151, time: 858.00
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7154, time: 858.39
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:----- Epoch[049/800], Train Loss: 0.7153, time: 854.33
+INFO:master_logger:----- Epoch[049/800], Train Loss: 0.7152, time: 854.33
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-49-Loss-0.715259619077928.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-49-Loss-0.715259619077928.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-49-Loss-0.715259619077928.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-49-Loss-0.715259619077928.pdopt
+INFO:local_logger:Now training epoch 50. LR=0.000150
+INFO:master_logger:Now training epoch 50. LR=0.000150
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7160
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7233
+INFO:master_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7153
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7204
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7224
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7084
+INFO:local_logger:Epoch[050/800], Step[0000/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7145
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7148
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7134
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7152
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7152
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7142
+INFO:master_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7145
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7144
+INFO:local_logger:Epoch[050/800], Step[0100/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7143
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7138
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7142
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7137
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7148
+INFO:master_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7142
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7145
+INFO:local_logger:Epoch[050/800], Step[0200/0626], Avg Loss: 0.7143
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7144
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7143
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7143
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7137
+INFO:local_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7137
+INFO:master_logger:Epoch[050/800], Step[0300/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7143
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7136
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7137
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7142
+INFO:master_logger:Epoch[050/800], Step[0400/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7137
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7142
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7137
+INFO:master_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7136
+INFO:local_logger:Epoch[050/800], Step[0500/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7139
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7142
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7138
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7141
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7141
+INFO:master_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[050/800], Step[0600/0626], Avg Loss: 0.7141
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7141, time: 882.33
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7141, time: 882.32
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7139, time: 878.40
+INFO:master_logger:----- Epoch[050/800], Train Loss: 0.7140, time: 878.40
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7144, time: 882.28
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7139, time: 882.66
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7140, time: 882.66
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7138, time: 882.67
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Epoch[050/800], Train Loss: 0.7140, time: 882.77
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-50-Loss-0.7138677369669435.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-50-Loss-0.7138677369669435.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-50-Loss-0.7138677369669435.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-50-Loss-0.7138677369669435.pdopt
+INFO:local_logger:Now training epoch 51. LR=0.000150
+INFO:master_logger:Now training epoch 51. LR=0.000150
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7092
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7248
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7198
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7048
+INFO:master_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7142
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7200
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7138
+INFO:local_logger:Epoch[051/800], Step[0000/0626], Avg Loss: 0.7145
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7135
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7127
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7135
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7137
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7118
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7132
+INFO:master_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7129
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7122
+INFO:local_logger:Epoch[051/800], Step[0100/0626], Avg Loss: 0.7129
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7131
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7137
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7133
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7133
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7129
+INFO:master_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[051/800], Step[0200/0626], Avg Loss: 0.7120
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7131
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7125
+INFO:master_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7129
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7131
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7127
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7129
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0300/0626], Avg Loss: 0.7129
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7130
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7122
+INFO:master_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7131
+INFO:local_logger:Epoch[051/800], Step[0400/0626], Avg Loss: 0.7127
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7124
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7127
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7122
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7129
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7129
+INFO:master_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[051/800], Step[0500/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7126
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7125
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7128
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7122
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7120
+INFO:master_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7125
+INFO:local_logger:Epoch[051/800], Step[0600/0626], Avg Loss: 0.7125
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7127, time: 848.53
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7120, time: 845.42
+INFO:master_logger:----- Epoch[051/800], Train Loss: 0.7125, time: 845.42
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7125, time: 849.57
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7126, time: 849.50
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7127, time: 849.57
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7126, time: 849.14
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7122, time: 849.16
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Epoch[051/800], Train Loss: 0.7126, time: 849.16
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-51-Loss-0.7120018708518765.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-51-Loss-0.7120018708518765.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-51-Loss-0.7120018708518765.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-51-Loss-0.7120018708518765.pdopt
+INFO:local_logger:Now training epoch 52. LR=0.000150
+INFO:master_logger:Now training epoch 52. LR=0.000150
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7075
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7092
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7105
+INFO:master_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7076
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.6988
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7050
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7185
+INFO:local_logger:Epoch[052/800], Step[0000/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7119
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7123
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7110
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7121
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7118
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7113
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7116
+INFO:master_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7117
+INFO:local_logger:Epoch[052/800], Step[0100/0626], Avg Loss: 0.7119
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7110
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7116
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7122
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7114
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7115
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7111
+INFO:master_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7115
+INFO:local_logger:Epoch[052/800], Step[0200/0626], Avg Loss: 0.7123
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7116
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7117
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7109
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7114
+INFO:master_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7113
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7111
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7113
+INFO:local_logger:Epoch[052/800], Step[0300/0626], Avg Loss: 0.7116
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7117
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7113
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7110
+INFO:master_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7111
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7107
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7117
+INFO:local_logger:Epoch[052/800], Step[0400/0626], Avg Loss: 0.7111
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7107
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7115
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7114
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7107
+INFO:master_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7111
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7109
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7116
+INFO:local_logger:Epoch[052/800], Step[0500/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7106
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7113
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7114
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7109
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7113
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7110
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7113
+INFO:master_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7111
+INFO:local_logger:Epoch[052/800], Step[0600/0626], Avg Loss: 0.7107
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7107, time: 899.12
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7109, time: 899.76
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7113, time: 899.78
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7113, time: 900.39
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7114, time: 896.39
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:master_logger:----- Epoch[052/800], Train Loss: 0.7110, time: 896.39
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7109, time: 899.78
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7105, time: 899.79
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:----- Epoch[052/800], Train Loss: 0.7112, time: 899.77
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-52-Loss-0.7113885321068728.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-52-Loss-0.7113885321068728.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-52-Loss-0.7113885321068728.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-52-Loss-0.7113885321068728.pdopt
+INFO:local_logger:Now training epoch 53. LR=0.000150
+INFO:master_logger:Now training epoch 53. LR=0.000150
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7064
+INFO:master_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7098
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7075
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7109
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7135
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7272
+INFO:local_logger:Epoch[053/800], Step[0000/0626], Avg Loss: 0.7107
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7100
+INFO:master_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7105
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7106
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7117
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7093
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7106
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7110
+INFO:local_logger:Epoch[053/800], Step[0100/0626], Avg Loss: 0.7096
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7106
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7117
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7098
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7105
+INFO:master_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0200/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7105
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7099
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7107
+INFO:master_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7099
+INFO:local_logger:Epoch[053/800], Step[0300/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7096
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7104
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7102
+INFO:master_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7104
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0400/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7105
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7104
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7096
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7102
+INFO:master_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0500/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7100
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7102
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7101
+INFO:master_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7100
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7101
+INFO:local_logger:Epoch[053/800], Step[0600/0626], Avg Loss: 0.7098
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7102, time: 889.34
+INFO:master_logger:----- Epoch[053/800], Train Loss: 0.7100, time: 889.34
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7102, time: 893.08
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7100, time: 893.08
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7094, time: 893.08
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7100, time: 893.10
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7100, time: 893.10
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7100, time: 893.75
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Epoch[053/800], Train Loss: 0.7098, time: 893.11
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-53-Loss-0.7102464560284915.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-53-Loss-0.7102464560284915.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-53-Loss-0.7102464560284915.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-53-Loss-0.7102464560284915.pdopt
+INFO:local_logger:Now training epoch 54. LR=0.000150
+INFO:master_logger:Now training epoch 54. LR=0.000150
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7124
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7103
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7021
+INFO:master_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7254
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7077
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7109
+INFO:local_logger:Epoch[054/800], Step[0000/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7105
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7093
+INFO:master_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7095
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7089
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7095
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7097
+INFO:local_logger:Epoch[054/800], Step[0100/0626], Avg Loss: 0.7087
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7082
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7086
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7092
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7092
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7087
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7085
+INFO:master_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0200/0626], Avg Loss: 0.7093
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7089
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7089
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7089
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7085
+INFO:master_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0300/0626], Avg Loss: 0.7092
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7089
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7090
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7083
+INFO:master_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7090
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7091
+INFO:local_logger:Epoch[054/800], Step[0400/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7087
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7084
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7085
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7087
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7090
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7090
+INFO:master_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7086
+INFO:local_logger:Epoch[054/800], Step[0500/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7085
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7087
+INFO:master_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7085
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[054/800], Step[0600/0626], Avg Loss: 0.7081
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7087, time: 903.80
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7082, time: 903.85
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7083, time: 904.37
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7081, time: 904.37
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7085, time: 904.35
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7087, time: 904.43
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7088, time: 900.86
+INFO:master_logger:----- Epoch[054/800], Train Loss: 0.7085, time: 900.86
+INFO:local_logger:----- Epoch[054/800], Train Loss: 0.7087, time: 904.47
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-54-Loss-0.7087632936406654.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-54-Loss-0.7087632936406654.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-54-Loss-0.7087632936406654.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-54-Loss-0.7087632936406654.pdopt
+INFO:local_logger:Now training epoch 55. LR=0.000150
+INFO:master_logger:Now training epoch 55. LR=0.000150
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.6897
+INFO:master_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7023
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.6937
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7040
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7203
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7165
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[055/800], Step[0000/0626], Avg Loss: 0.7108
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7093
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7084
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7084
+INFO:master_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7082
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7082
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0100/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7086
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7080
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7086
+INFO:master_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7088
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0200/0626], Avg Loss: 0.7077
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7071
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7086
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7076
+INFO:master_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7080
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7086
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[055/800], Step[0300/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7073
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7076
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7087
+INFO:master_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[055/800], Step[0400/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7087
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7077
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7082
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7076
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7073
+INFO:master_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7078
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[055/800], Step[0500/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7073
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7081
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7072
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7084
+INFO:master_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7076
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7077
+INFO:local_logger:Epoch[055/800], Step[0600/0626], Avg Loss: 0.7074
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7081, time: 859.40
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7074, time: 855.79
+INFO:master_logger:----- Epoch[055/800], Train Loss: 0.7076, time: 855.79
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7073, time: 859.94
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7073, time: 859.85
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7072, time: 860.45
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7074, time: 859.97
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7077, time: 859.89
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Epoch[055/800], Train Loss: 0.7083, time: 860.52
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-55-Loss-0.7074442110429905.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-55-Loss-0.7074442110429905.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-55-Loss-0.7074442110429905.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-55-Loss-0.7074442110429905.pdopt
+INFO:local_logger:Now training epoch 56. LR=0.000150
+INFO:master_logger:Now training epoch 56. LR=0.000150
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.6969
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7058
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7068
+INFO:master_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7070
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7114
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7124
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.6985
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7150
+INFO:local_logger:Epoch[056/800], Step[0000/0626], Avg Loss: 0.7089
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7079
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7056
+INFO:master_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0100/0626], Avg Loss: 0.7070
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7071
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7075
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7059
+INFO:master_logger:Epoch[056/800], Step[0200/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7077
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7064
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7059
+INFO:local_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7065
+INFO:master_logger:Epoch[056/800], Step[0300/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7060
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7065
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7073
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7072
+INFO:master_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[056/800], Step[0400/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7069
+INFO:master_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0500/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7066
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7069
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7061
+INFO:master_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7065
+INFO:local_logger:Epoch[056/800], Step[0600/0626], Avg Loss: 0.7065
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7067, time: 890.40
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7065, time: 890.61
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7066, time: 890.94
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7068, time: 891.55
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7063, time: 890.99
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7060, time: 891.01
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7063, time: 887.67
+INFO:master_logger:----- Epoch[056/800], Train Loss: 0.7065, time: 887.67
+INFO:local_logger:----- Epoch[056/800], Train Loss: 0.7069, time: 890.98
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-56-Loss-0.7062517588045653.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-56-Loss-0.7062517588045653.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-56-Loss-0.7062517588045653.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-56-Loss-0.7062517588045653.pdopt
+INFO:local_logger:Now training epoch 57. LR=0.000150
+INFO:master_logger:Now training epoch 57. LR=0.000150
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7140
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7131
+INFO:master_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7096
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7165
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[057/800], Step[0000/0626], Avg Loss: 0.7171
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7054
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7053
+INFO:local_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7065
+INFO:master_logger:Epoch[057/800], Step[0100/0626], Avg Loss: 0.7061
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7056
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7067
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7058
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7051
+INFO:master_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7059
+INFO:local_logger:Epoch[057/800], Step[0200/0626], Avg Loss: 0.7061
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7056
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7059
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7052
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7058
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7054
+INFO:master_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7054
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[057/800], Step[0300/0626], Avg Loss: 0.7061
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7058
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7060
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7052
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7059
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7056
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7063
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7055
+INFO:master_logger:Epoch[057/800], Step[0400/0626], Avg Loss: 0.7058
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7054
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7050
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7059
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7055
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7061
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7059
+INFO:master_logger:Epoch[057/800], Step[0500/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7051
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7058
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7054
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7054
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7062
+INFO:local_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7056
+INFO:master_logger:Epoch[057/800], Step[0600/0626], Avg Loss: 0.7057
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7060, time: 853.51
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7053, time: 853.54
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7057, time: 854.16
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7050, time: 854.00
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7061, time: 854.36
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7056, time: 849.91
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7054, time: 853.96
+INFO:master_logger:----- Epoch[057/800], Train Loss: 0.7056, time: 849.91
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:----- Epoch[057/800], Train Loss: 0.7056, time: 853.96
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-57-Loss-0.70561545900947.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-57-Loss-0.70561545900947.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-57-Loss-0.70561545900947.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-57-Loss-0.70561545900947.pdopt
+INFO:local_logger:Now training epoch 58. LR=0.000150
+INFO:master_logger:Now training epoch 58. LR=0.000150
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7087
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.6950
+INFO:master_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7032
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.6990
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7091
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7010
+INFO:local_logger:Epoch[058/800], Step[0000/0626], Avg Loss: 0.7083
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7057
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7033
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7044
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7042
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7051
+INFO:master_logger:Epoch[058/800], Step[0100/0626], Avg Loss: 0.7045
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7056
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7050
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7052
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7048
+INFO:master_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7051
+INFO:local_logger:Epoch[058/800], Step[0200/0626], Avg Loss: 0.7032
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7052
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7051
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7046
+INFO:master_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7049
+INFO:local_logger:Epoch[058/800], Step[0300/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7048
+INFO:master_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0400/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7043
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7034
+INFO:master_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7044
+INFO:local_logger:Epoch[058/800], Step[0500/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7045
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7045
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7046
+INFO:master_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7044
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7045
+INFO:local_logger:Epoch[058/800], Step[0600/0626], Avg Loss: 0.7042
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7045, time: 883.86
+INFO:master_logger:----- Epoch[058/800], Train Loss: 0.7044, time: 883.86
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7044, time: 888.09
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7043, time: 888.09
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7045, time: 887.67
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7036, time: 887.78
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7046, time: 888.37
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7045, time: 887.96
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Epoch[058/800], Train Loss: 0.7046, time: 887.96
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-58-Loss-0.704508643255555.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-58-Loss-0.704508643255555.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-58-Loss-0.704508643255555.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-58-Loss-0.704508643255555.pdopt
+INFO:local_logger:Now training epoch 59. LR=0.000151
+INFO:master_logger:Now training epoch 59. LR=0.000151
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.6945
+INFO:master_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.7041
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.7028
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.7075
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.7132
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.7134
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.7138
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[059/800], Step[0000/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7047
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7039
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7043
+INFO:master_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7041
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7048
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7046
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7044
+INFO:local_logger:Epoch[059/800], Step[0100/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7040
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7040
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7041
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7032
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7039
+INFO:master_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7021
+INFO:local_logger:Epoch[059/800], Step[0200/0626], Avg Loss: 0.7043
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7041
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7040
+INFO:master_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7033
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7027
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7040
+INFO:local_logger:Epoch[059/800], Step[0300/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7040
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7033
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7039
+INFO:master_logger:Epoch[059/800], Step[0400/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7039
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7031
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7037
+INFO:master_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7031
+INFO:local_logger:Epoch[059/800], Step[0500/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7035
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7034
+INFO:master_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7032
+INFO:local_logger:Epoch[059/800], Step[0600/0626], Avg Loss: 0.7030
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7034, time: 855.63
+INFO:master_logger:----- Epoch[059/800], Train Loss: 0.7034, time: 855.63
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7034, time: 859.41
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7035, time: 859.16
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7030, time: 859.83
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7033, time: 859.66
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7036, time: 859.94
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7037, time: 859.65
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Epoch[059/800], Train Loss: 0.7032, time: 859.97
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-59-Loss-0.7033895635093321.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-59-Loss-0.7033895635093321.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-59-Loss-0.7033895635093321.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-59-Loss-0.7033895635093321.pdopt
+INFO:local_logger:Now training epoch 60. LR=0.000151
+INFO:master_logger:Now training epoch 60. LR=0.000151
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7205
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7122
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7150
+INFO:master_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7093
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7068
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.7212
+INFO:local_logger:Epoch[060/800], Step[0000/0626], Avg Loss: 0.6992
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7019
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7031
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7027
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7041
+INFO:master_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0100/0626], Avg Loss: 0.7037
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7033
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7043
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7030
+INFO:master_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0200/0626], Avg Loss: 0.7033
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7030
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7034
+INFO:master_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7028
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7027
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0300/0626], Avg Loss: 0.7030
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7031
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7029
+INFO:master_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7028
+INFO:local_logger:Epoch[060/800], Step[0400/0626], Avg Loss: 0.7023
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7023
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7028
+INFO:local_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7022
+INFO:master_logger:Epoch[060/800], Step[0500/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7029
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7025
+INFO:master_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7023
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7021
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7023
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[060/800], Step[0600/0626], Avg Loss: 0.7028
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7020, time: 883.22
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7027, time: 884.34
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7025, time: 883.83
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7022, time: 883.86
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7030, time: 883.87
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7023, time: 883.92
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7024, time: 883.95
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:----- Epoch[060/800], Train Loss: 0.7025, time: 880.74
+INFO:master_logger:----- Epoch[060/800], Train Loss: 0.7024, time: 880.74
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-60-Loss-0.7024735177213701.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-60-Loss-0.7024735177213701.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-60-Loss-0.7024735177213701.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-60-Loss-0.7024735177213701.pdopt
+INFO:local_logger:Now training epoch 61. LR=0.000151
+INFO:master_logger:Now training epoch 61. LR=0.000151
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.6940
+INFO:master_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.6741
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[061/800], Step[0000/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7019
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7019
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7011
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7016
+INFO:master_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7018
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[061/800], Step[0100/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7028
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7015
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7027
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7011
+INFO:master_logger:Epoch[061/800], Step[0200/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7016
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7010
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7024
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7020
+INFO:master_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7018
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[061/800], Step[0300/0626], Avg Loss: 0.7026
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7012
+INFO:master_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7018
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7019
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7021
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7016
+INFO:local_logger:Epoch[061/800], Step[0400/0626], Avg Loss: 0.7019
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7013
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7019
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7022
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7022
+INFO:master_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7018
+INFO:local_logger:Epoch[061/800], Step[0500/0626], Avg Loss: 0.7015
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7021
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7013
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7018
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7013
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7015
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7014
+INFO:master_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7016
+INFO:local_logger:Epoch[061/800], Step[0600/0626], Avg Loss: 0.7019
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7015, time: 862.93
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7014, time: 863.80
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7013, time: 863.80
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7012, time: 860.38
+INFO:master_logger:----- Epoch[061/800], Train Loss: 0.7015, time: 860.38
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7017, time: 864.16
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7012, time: 864.25
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7019, time: 865.44
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Epoch[061/800], Train Loss: 0.7021, time: 864.25
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-61-Loss-0.7012385332086987.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-61-Loss-0.7012385332086987.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-61-Loss-0.7012385332086987.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-61-Loss-0.7012385332086987.pdopt
+INFO:local_logger:Now training epoch 62. LR=0.000151
+INFO:master_logger:Now training epoch 62. LR=0.000151
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.7062
+INFO:master_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.7003
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.7018
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.7010
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.7098
+INFO:local_logger:Epoch[062/800], Step[0000/0626], Avg Loss: 0.7053
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7009
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7013
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.6986
+INFO:master_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0100/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7011
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.6996
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7010
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7009
+INFO:master_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0200/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7011
+INFO:master_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7007
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[062/800], Step[0300/0626], Avg Loss: 0.7007
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7012
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7000
+INFO:master_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[062/800], Step[0400/0626], Avg Loss: 0.7011
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7014
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7009
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7009
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7002
+INFO:master_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0500/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7014
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7008
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7006
+INFO:master_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[062/800], Step[0600/0626], Avg Loss: 0.7006
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7001, time: 880.83
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7001, time: 877.21
+INFO:master_logger:----- Epoch[062/800], Train Loss: 0.7005, time: 877.21
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7008, time: 880.96
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7008, time: 881.01
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7006, time: 881.03
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7005, time: 881.40
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7001, time: 881.51
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Epoch[062/800], Train Loss: 0.7013, time: 882.39
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-62-Loss-0.7001281701651857.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-62-Loss-0.7001281701651857.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-62-Loss-0.7001281701651857.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-62-Loss-0.7001281701651857.pdopt
+INFO:local_logger:Now training epoch 63. LR=0.000151
+INFO:master_logger:Now training epoch 63. LR=0.000151
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.7051
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.7172
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.6949
+INFO:master_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.7023
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.7033
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[063/800], Step[0000/0626], Avg Loss: 0.6763
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7001
+INFO:master_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0100/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.6985
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.7006
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.6999
+INFO:master_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0200/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.7013
+INFO:master_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.6989
+INFO:local_logger:Epoch[063/800], Step[0300/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.6996
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.7011
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.7001
+INFO:master_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[063/800], Step[0400/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.7001
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.7008
+INFO:master_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[063/800], Step[0500/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.7005
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6997
+INFO:master_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[063/800], Step[0600/0626], Avg Loss: 0.6998
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6997, time: 870.66
+INFO:master_logger:----- Epoch[063/800], Train Loss: 0.6997, time: 870.66
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6998, time: 874.64
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6993, time: 874.57
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6998, time: 874.56
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6997, time: 874.53
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6993, time: 874.53
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.6998, time: 874.79
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Epoch[063/800], Train Loss: 0.7004, time: 874.66
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-63-Loss-0.6997380468063858.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-63-Loss-0.6997380468063858.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-63-Loss-0.6997380468063858.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-63-Loss-0.6997380468063858.pdopt
+INFO:local_logger:Now training epoch 64. LR=0.000151
+INFO:master_logger:Now training epoch 64. LR=0.000151
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.7015
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.6893
+INFO:master_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.7009
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.7021
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.7124
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.7052
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.7017
+INFO:local_logger:Epoch[064/800], Step[0000/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6988
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6993
+INFO:master_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6986
+INFO:local_logger:Epoch[064/800], Step[0100/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6987
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.7007
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6998
+INFO:master_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[064/800], Step[0200/0626], Avg Loss: 0.7003
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6992
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6999
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6986
+INFO:master_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[064/800], Step[0300/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6996
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6997
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6996
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6992
+INFO:master_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6992
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6992
+INFO:local_logger:Epoch[064/800], Step[0400/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6989
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6983
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6992
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6996
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6988
+INFO:master_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[064/800], Step[0500/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6991
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6992
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6987
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6989
+INFO:master_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6990
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6988
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[064/800], Step[0600/0626], Avg Loss: 0.6984
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6992, time: 883.12
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6983, time: 883.49
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6991, time: 883.58
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6987, time: 883.60
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6992, time: 880.22
+INFO:master_logger:----- Epoch[064/800], Train Loss: 0.6990, time: 880.22
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6988, time: 883.73
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6989, time: 883.71
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Epoch[064/800], Train Loss: 0.6993, time: 883.76
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-64-Loss-0.6992388257000982.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-64-Loss-0.6992388257000982.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-64-Loss-0.6992388257000982.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-64-Loss-0.6992388257000982.pdopt
+INFO:local_logger:Now training epoch 65. LR=0.000151
+INFO:master_logger:Now training epoch 65. LR=0.000151
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.7020
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.7021
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.7097
+INFO:master_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.7040
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.6956
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[065/800], Step[0000/0626], Avg Loss: 0.6983
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6990
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6990
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6976
+INFO:master_logger:Epoch[065/800], Step[0100/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6989
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6979
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6986
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6989
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6983
+INFO:master_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6985
+INFO:local_logger:Epoch[065/800], Step[0200/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6983
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6990
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6986
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6982
+INFO:master_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[065/800], Step[0300/0626], Avg Loss: 0.6979
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6985
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6990
+INFO:master_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6982
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6978
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0400/0626], Avg Loss: 0.6982
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6983
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6990
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6980
+INFO:master_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0500/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6982
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6982
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6988
+INFO:master_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6978
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6975
+INFO:local_logger:Epoch[065/800], Step[0600/0626], Avg Loss: 0.6981
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6981, time: 871.05
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6988, time: 867.64
+INFO:master_logger:----- Epoch[065/800], Train Loss: 0.6980, time: 867.64
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6977, time: 871.33
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6980, time: 872.01
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6975, time: 871.43
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6982, time: 871.92
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6981, time: 871.81
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Epoch[065/800], Train Loss: 0.6978, time: 871.98
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-65-Loss-0.6988199688404415.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-65-Loss-0.6988199688404415.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-65-Loss-0.6988199688404415.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-65-Loss-0.6988199688404415.pdopt
+INFO:local_logger:Now training epoch 66. LR=0.000151
+INFO:master_logger:Now training epoch 66. LR=0.000151
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6997
+INFO:master_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6995
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6986
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.7179
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.7034
+INFO:local_logger:Epoch[066/800], Step[0000/0626], Avg Loss: 0.6970
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6978
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6987
+INFO:master_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6979
+INFO:local_logger:Epoch[066/800], Step[0100/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6971
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6974
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6988
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6983
+INFO:master_logger:Epoch[066/800], Step[0200/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6978
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6971
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6975
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6984
+INFO:master_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0300/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6975
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6974
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6970
+INFO:master_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[066/800], Step[0400/0626], Avg Loss: 0.6979
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6978
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6971
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6970
+INFO:master_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6974
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6975
+INFO:local_logger:Epoch[066/800], Step[0500/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6971
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6970
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6970
+INFO:local_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6975
+INFO:master_logger:Epoch[066/800], Step[0600/0626], Avg Loss: 0.6973
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6970, time: 874.18
+INFO:master_logger:----- Epoch[066/800], Train Loss: 0.6973, time: 874.18
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6976, time: 878.36
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6969, time: 877.55
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6971, time: 878.04
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6977, time: 877.55
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6970, time: 878.01
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6975, time: 878.03
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Epoch[066/800], Train Loss: 0.6975, time: 878.01
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-66-Loss-0.6970274622691075.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-66-Loss-0.6970274622691075.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-66-Loss-0.6970274622691075.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-66-Loss-0.6970274622691075.pdopt
+INFO:local_logger:Now training epoch 67. LR=0.000151
+INFO:master_logger:Now training epoch 67. LR=0.000151
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6978
+INFO:master_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.7094
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6943
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[067/800], Step[0000/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6981
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6969
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6956
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6962
+INFO:master_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[067/800], Step[0100/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6974
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6971
+INFO:master_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6974
+INFO:local_logger:Epoch[067/800], Step[0200/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6969
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6967
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6964
+INFO:master_logger:Epoch[067/800], Step[0300/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6964
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6971
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6967
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6966
+INFO:master_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0400/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6959
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6964
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6968
+INFO:master_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[067/800], Step[0500/0626], Avg Loss: 0.6971
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6970
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6970
+INFO:master_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[067/800], Step[0600/0626], Avg Loss: 0.6961
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6970, time: 875.35
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6968, time: 872.12
+INFO:master_logger:----- Epoch[067/800], Train Loss: 0.6965, time: 872.12
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6965, time: 876.01
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6965, time: 875.96
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6970, time: 875.59
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6961, time: 876.05
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6965, time: 875.92
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Epoch[067/800], Train Loss: 0.6960, time: 876.13
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-67-Loss-0.696807305239932.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-67-Loss-0.696807305239932.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-67-Loss-0.696807305239932.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-67-Loss-0.696807305239932.pdopt
+INFO:local_logger:Now training epoch 68. LR=0.000151
+INFO:master_logger:Now training epoch 68. LR=0.000151
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.6982
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.7004
+INFO:master_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.6972
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.7036
+INFO:local_logger:Epoch[068/800], Step[0000/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6969
+INFO:master_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6964
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6970
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[068/800], Step[0100/0626], Avg Loss: 0.6976
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6962
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6952
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6969
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6954
+INFO:master_logger:Epoch[068/800], Step[0200/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6964
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6963
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6970
+INFO:local_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6958
+INFO:master_logger:Epoch[068/800], Step[0300/0626], Avg Loss: 0.6959
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6963
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6962
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6956
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6959
+INFO:master_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6959
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[068/800], Step[0400/0626], Avg Loss: 0.6955
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6962
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6965
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6959
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6957
+INFO:master_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[068/800], Step[0500/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6957
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6957
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6962
+INFO:local_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6960
+INFO:master_logger:Epoch[068/800], Step[0600/0626], Avg Loss: 0.6957
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6946, time: 870.06
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6959, time: 870.67
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6958, time: 871.34
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6961, time: 870.65
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6958, time: 870.67
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6961, time: 870.74
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6960, time: 870.73
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:----- Epoch[068/800], Train Loss: 0.6957, time: 867.42
+INFO:master_logger:----- Epoch[068/800], Train Loss: 0.6957, time: 867.42
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-68-Loss-0.6956601794348751.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-68-Loss-0.6956601794348751.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-68-Loss-0.6956601794348751.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-68-Loss-0.6956601794348751.pdopt
+INFO:local_logger:Now training epoch 69. LR=0.000151
+INFO:master_logger:Now training epoch 69. LR=0.000151
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6951
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6998
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6932
+INFO:master_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.7177
+INFO:local_logger:Epoch[069/800], Step[0000/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6957
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6964
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6955
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6955
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6959
+INFO:master_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6956
+INFO:local_logger:Epoch[069/800], Step[0100/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6949
+INFO:master_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[069/800], Step[0200/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6953
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6944
+INFO:master_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6951
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[069/800], Step[0300/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6952
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6953
+INFO:master_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6952
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[069/800], Step[0400/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6956
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6955
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6950
+INFO:master_logger:Epoch[069/800], Step[0500/0626], Avg Loss: 0.6951
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6957
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6947
+INFO:master_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6953
+INFO:local_logger:Epoch[069/800], Step[0600/0626], Avg Loss: 0.6953
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6957, time: 877.32
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6947, time: 876.82
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6954, time: 877.40
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6947, time: 877.36
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6953, time: 877.48
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6948, time: 877.49
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6945, time: 877.44
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:----- Epoch[069/800], Train Loss: 0.6950, time: 873.74
+INFO:master_logger:----- Epoch[069/800], Train Loss: 0.6950, time: 873.74
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-69-Loss-0.6949500013049544.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-69-Loss-0.6949500013049544.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-69-Loss-0.6949500013049544.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-69-Loss-0.6949500013049544.pdopt
+INFO:local_logger:Now training epoch 70. LR=0.000151
+INFO:master_logger:Now training epoch 70. LR=0.000151
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6983
+INFO:master_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6937
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.7030
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.7050
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6782
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6987
+INFO:local_logger:Epoch[070/800], Step[0000/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6955
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6946
+INFO:master_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6943
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6940
+INFO:local_logger:Epoch[070/800], Step[0100/0626], Avg Loss: 0.6951
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6956
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6952
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6944
+INFO:master_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6951
+INFO:local_logger:Epoch[070/800], Step[0200/0626], Avg Loss: 0.6958
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6952
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6946
+INFO:master_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6959
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6953
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[070/800], Step[0300/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6943
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6951
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6942
+INFO:local_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6949
+INFO:master_logger:Epoch[070/800], Step[0400/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6942
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6941
+INFO:master_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[070/800], Step[0500/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6942
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6941
+INFO:master_logger:Epoch[070/800], Step[0600/0626], Avg Loss: 0.6945
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6942, time: 864.91
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6942, time: 865.61
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6941, time: 865.52
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6949, time: 864.93
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6946, time: 861.12
+INFO:master_logger:----- Epoch[070/800], Train Loss: 0.6945, time: 861.12
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6945, time: 864.92
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6947, time: 864.88
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Epoch[070/800], Train Loss: 0.6948, time: 864.89
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-70-Loss-0.6946037372341894.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-70-Loss-0.6946037372341894.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-70-Loss-0.6946037372341894.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-70-Loss-0.6946037372341894.pdopt
+INFO:local_logger:Now training epoch 71. LR=0.000151
+INFO:master_logger:Now training epoch 71. LR=0.000151
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6850
+INFO:master_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6757
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.7043
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.7002
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6977
+INFO:local_logger:Epoch[071/800], Step[0000/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6950
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6932
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6937
+INFO:master_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0100/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6943
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6942
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6935
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6942
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6936
+INFO:master_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0200/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6937
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6944
+INFO:master_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6937
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[071/800], Step[0300/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6946
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6940
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6940
+INFO:master_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0400/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6945
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6937
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6944
+INFO:master_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0500/0626], Avg Loss: 0.6932
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6937
+INFO:master_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[071/800], Step[0600/0626], Avg Loss: 0.6943
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6943, time: 885.74
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6939, time: 886.70
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6940, time: 886.74
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6935, time: 886.79
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6938, time: 886.70
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6934, time: 886.76
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6937, time: 882.95
+INFO:master_logger:----- Epoch[071/800], Train Loss: 0.6939, time: 882.95
+INFO:local_logger:----- Epoch[071/800], Train Loss: 0.6945, time: 886.78
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-71-Loss-0.6937192819392223.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-71-Loss-0.6937192819392223.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-71-Loss-0.6937192819392223.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-71-Loss-0.6937192819392223.pdopt
+INFO:local_logger:Now training epoch 72. LR=0.000152
+INFO:master_logger:Now training epoch 72. LR=0.000152
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.6943
+INFO:master_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.6974
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.7013
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.7045
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.6954
+INFO:local_logger:Epoch[072/800], Step[0000/0626], Avg Loss: 0.7056
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6944
+INFO:master_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[072/800], Step[0100/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6937
+INFO:master_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6935
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[072/800], Step[0200/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6932
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6935
+INFO:master_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6942
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[072/800], Step[0300/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6930
+INFO:master_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[072/800], Step[0400/0626], Avg Loss: 0.6926
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6937
+INFO:master_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[072/800], Step[0500/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6937
+INFO:master_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[072/800], Step[0600/0626], Avg Loss: 0.6937
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6928, time: 870.67
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6930, time: 869.72
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6936, time: 869.72
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6930, time: 869.77
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6934, time: 870.14
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6936, time: 870.14
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6936, time: 866.43
+INFO:local_logger:----- Epoch[072/800], Train Loss: 0.6929, time: 870.14
+INFO:master_logger:----- Epoch[072/800], Train Loss: 0.6932, time: 866.43
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-72-Loss-0.693554089917601.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-72-Loss-0.693554089917601.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-72-Loss-0.693554089917601.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-72-Loss-0.693554089917601.pdopt
+INFO:local_logger:Now training epoch 73. LR=0.000152
+INFO:master_logger:Now training epoch 73. LR=0.000152
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6943
+INFO:master_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6940
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6941
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6980
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6944
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[073/800], Step[0000/0626], Avg Loss: 0.7038
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6935
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6940
+INFO:master_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[073/800], Step[0100/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6925
+INFO:master_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[073/800], Step[0200/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6932
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6937
+INFO:master_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0300/0626], Avg Loss: 0.6926
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6923
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6933
+INFO:master_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[073/800], Step[0400/0626], Avg Loss: 0.6926
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6923
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6929
+INFO:master_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6926
+INFO:local_logger:Epoch[073/800], Step[0500/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6928
+INFO:master_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[073/800], Step[0600/0626], Avg Loss: 0.6929
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6926, time: 884.48
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6921, time: 884.48
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6928, time: 884.08
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6926, time: 880.49
+INFO:master_logger:----- Epoch[073/800], Train Loss: 0.6925, time: 880.49
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6917, time: 884.29
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6920, time: 884.32
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6929, time: 884.79
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Epoch[073/800], Train Loss: 0.6930, time: 884.73
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-73-Loss-0.6925669066549239.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-73-Loss-0.6925669066549239.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-73-Loss-0.6925669066549239.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-73-Loss-0.6925669066549239.pdopt
+INFO:local_logger:Now training epoch 74. LR=0.000152
+INFO:master_logger:Now training epoch 74. LR=0.000152
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6894
+INFO:master_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.7013
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.6982
+INFO:local_logger:Epoch[074/800], Step[0000/0626], Avg Loss: 0.7004
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6923
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6927
+INFO:master_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0100/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6916
+INFO:master_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[074/800], Step[0200/0626], Avg Loss: 0.6927
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6929
+INFO:master_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6923
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[074/800], Step[0300/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6926
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6920
+INFO:master_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[074/800], Step[0400/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6923
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6920
+INFO:master_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[074/800], Step[0500/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6923
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6922
+INFO:master_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[074/800], Step[0600/0626], Avg Loss: 0.6922
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6922, time: 868.83
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6922, time: 869.13
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6916, time: 868.88
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6920, time: 868.87
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6917, time: 869.16
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6917, time: 868.97
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6918, time: 865.36
+INFO:master_logger:----- Epoch[074/800], Train Loss: 0.6919, time: 865.36
+INFO:local_logger:----- Epoch[074/800], Train Loss: 0.6922, time: 869.28
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-74-Loss-0.6918195025700205.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-74-Loss-0.6918195025700205.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-74-Loss-0.6918195025700205.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-74-Loss-0.6918195025700205.pdopt
+INFO:local_logger:Now training epoch 75. LR=0.000152
+INFO:master_logger:Now training epoch 75. LR=0.000152
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6855
+INFO:master_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6909
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6973
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6943
+INFO:local_logger:Epoch[075/800], Step[0000/0626], Avg Loss: 0.6994
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6919
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6928
+INFO:master_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[075/800], Step[0100/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6911
+INFO:master_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[075/800], Step[0200/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6913
+INFO:master_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[075/800], Step[0300/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6919
+INFO:master_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[075/800], Step[0400/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6913
+INFO:master_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[075/800], Step[0500/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6916
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6910
+INFO:master_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[075/800], Step[0600/0626], Avg Loss: 0.6913
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6918, time: 889.31
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6917, time: 889.53
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6907, time: 889.63
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6910, time: 885.80
+INFO:master_logger:----- Epoch[075/800], Train Loss: 0.6913, time: 885.80
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6913, time: 889.63
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6912, time: 890.10
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6916, time: 890.11
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Epoch[075/800], Train Loss: 0.6911, time: 890.09
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-75-Loss-0.6910110246189355.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-75-Loss-0.6910110246189355.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-75-Loss-0.6910110246189355.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-75-Loss-0.6910110246189355.pdopt
+INFO:local_logger:Now training epoch 76. LR=0.000152
+INFO:master_logger:Now training epoch 76. LR=0.000152
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6949
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6967
+INFO:master_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6921
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6918
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[076/800], Step[0000/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6912
+INFO:master_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0100/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6909
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6909
+INFO:local_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6905
+INFO:master_logger:Epoch[076/800], Step[0200/0626], Avg Loss: 0.6909
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6906
+INFO:master_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[076/800], Step[0300/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6901
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6908
+INFO:master_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[076/800], Step[0400/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6903
+INFO:master_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0500/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6902
+INFO:master_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[076/800], Step[0600/0626], Avg Loss: 0.6903
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6911, time: 859.45
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6906, time: 855.76
+INFO:master_logger:----- Epoch[076/800], Train Loss: 0.6906, time: 855.76
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6910, time: 859.92
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6907, time: 859.58
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6906, time: 859.59
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6903, time: 860.18
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6903, time: 860.24
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Epoch[076/800], Train Loss: 0.6904, time: 859.60
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-76-Loss-0.6905609986769955.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-76-Loss-0.6905609986769955.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-76-Loss-0.6905609986769955.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-76-Loss-0.6905609986769955.pdopt
+INFO:local_logger:Now training epoch 77. LR=0.000152
+INFO:master_logger:Now training epoch 77. LR=0.000152
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6953
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6740
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6827
+INFO:master_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.7074
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.7042
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6947
+INFO:local_logger:Epoch[077/800], Step[0000/0626], Avg Loss: 0.6790
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6911
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6909
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6900
+INFO:master_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[077/800], Step[0100/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6901
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6907
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6892
+INFO:master_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0200/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6908
+INFO:master_logger:Epoch[077/800], Step[0300/0626], Avg Loss: 0.6901
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6902
+INFO:master_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0400/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6896
+INFO:master_logger:Epoch[077/800], Step[0500/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6906
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6898
+INFO:master_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[077/800], Step[0600/0626], Avg Loss: 0.6904
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6899, time: 882.70
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6898, time: 884.37
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6901, time: 883.82
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6906, time: 883.80
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6902, time: 883.80
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6905, time: 883.93
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6898, time: 880.45
+INFO:master_logger:----- Epoch[077/800], Train Loss: 0.6902, time: 880.45
+INFO:local_logger:----- Epoch[077/800], Train Loss: 0.6905, time: 883.84
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-77-Loss-0.6897522572932259.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-77-Loss-0.6897522572932259.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-77-Loss-0.6897522572932259.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-77-Loss-0.6897522572932259.pdopt
+INFO:local_logger:Now training epoch 78. LR=0.000152
+INFO:master_logger:Now training epoch 78. LR=0.000152
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6877
+INFO:master_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6979
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[078/800], Step[0000/0626], Avg Loss: 0.6743
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6884
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6884
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6902
+INFO:master_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0100/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6895
+INFO:master_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0200/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6896
+INFO:master_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6903
+INFO:local_logger:Epoch[078/800], Step[0300/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6901
+INFO:master_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[078/800], Step[0400/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6898
+INFO:master_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0500/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6897
+INFO:master_logger:Epoch[078/800], Step[0600/0626], Avg Loss: 0.6896
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6895, time: 850.97
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6897, time: 849.85
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6897, time: 850.45
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6894, time: 850.45
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6895, time: 846.72
+INFO:master_logger:----- Epoch[078/800], Train Loss: 0.6896, time: 846.72
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6897, time: 850.44
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6894, time: 850.46
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Epoch[078/800], Train Loss: 0.6898, time: 850.47
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-78-Loss-0.6895287958086865.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-78-Loss-0.6895287958086865.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-78-Loss-0.6895287958086865.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-78-Loss-0.6895287958086865.pdopt
+INFO:local_logger:Now training epoch 79. LR=0.000152
+INFO:master_logger:Now training epoch 79. LR=0.000152
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6870
+INFO:master_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.7025
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6939
+INFO:local_logger:Epoch[079/800], Step[0000/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6901
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6892
+INFO:master_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6902
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[079/800], Step[0100/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6900
+INFO:master_logger:Epoch[079/800], Step[0200/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6894
+INFO:master_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[079/800], Step[0300/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6891
+INFO:master_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[079/800], Step[0400/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6895
+INFO:local_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6888
+INFO:master_logger:Epoch[079/800], Step[0500/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6892
+INFO:master_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6894
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6896
+INFO:local_logger:Epoch[079/800], Step[0600/0626], Avg Loss: 0.6891
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6892, time: 888.30
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6889, time: 888.13
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6891, time: 888.74
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6896, time: 888.14
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6886, time: 884.16
+INFO:master_logger:----- Epoch[079/800], Train Loss: 0.6892, time: 884.16
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6893, time: 888.24
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6896, time: 888.24
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:----- Epoch[079/800], Train Loss: 0.6893, time: 888.26
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-79-Loss-0.6886302635034396.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-79-Loss-0.6886302635034396.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-79-Loss-0.6886302635034396.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-79-Loss-0.6886302635034396.pdopt
+INFO:local_logger:Now training epoch 80. LR=0.000152
+INFO:master_logger:Now training epoch 80. LR=0.000152
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6859
+INFO:master_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.7038
+INFO:local_logger:Epoch[080/800], Step[0000/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6877
+INFO:master_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[080/800], Step[0100/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6886
+INFO:master_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[080/800], Step[0200/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6887
+INFO:master_logger:Epoch[080/800], Step[0300/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6886
+INFO:master_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[080/800], Step[0400/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6888
+INFO:master_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[080/800], Step[0500/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6888
+INFO:master_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6889
+INFO:local_logger:Epoch[080/800], Step[0600/0626], Avg Loss: 0.6890
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6890, time: 849.18
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6888, time: 849.16
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6890, time: 849.36
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6887, time: 849.41
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6888, time: 849.87
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6889, time: 849.31
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6888, time: 845.47
+INFO:master_logger:----- Epoch[080/800], Train Loss: 0.6889, time: 845.47
+INFO:local_logger:----- Epoch[080/800], Train Loss: 0.6891, time: 849.50
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-80-Loss-0.6887507163269282.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-80-Loss-0.6887507163269282.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-80-Loss-0.6887507163269282.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-80-Loss-0.6887507163269282.pdopt
+INFO:local_logger:Now training epoch 81. LR=0.000153
+INFO:master_logger:Now training epoch 81. LR=0.000153
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6999
+INFO:master_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6984
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.7003
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[081/800], Step[0000/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6890
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6882
+INFO:master_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[081/800], Step[0100/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6892
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6886
+INFO:master_logger:Epoch[081/800], Step[0200/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6881
+INFO:master_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[081/800], Step[0300/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6886
+INFO:master_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[081/800], Step[0400/0626], Avg Loss: 0.6888
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6884
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6878
+INFO:master_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[081/800], Step[0500/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6886
+INFO:master_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[081/800], Step[0600/0626], Avg Loss: 0.6885
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6878, time: 886.96
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6882, time: 887.25
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6882, time: 887.32
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6883, time: 887.55
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6878, time: 887.45
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6883, time: 887.60
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6880, time: 887.62
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:----- Epoch[081/800], Train Loss: 0.6885, time: 883.70
+INFO:master_logger:----- Epoch[081/800], Train Loss: 0.6882, time: 883.70
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-81-Loss-0.6885438528329545.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-81-Loss-0.6885438528329545.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-81-Loss-0.6885438528329545.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-81-Loss-0.6885438528329545.pdopt
+INFO:local_logger:Now training epoch 82. LR=0.000153
+INFO:master_logger:Now training epoch 82. LR=0.000153
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6723
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6813
+INFO:master_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6948
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6790
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6783
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.7052
+INFO:local_logger:Epoch[082/800], Step[0000/0626], Avg Loss: 0.6743
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6874
+INFO:master_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[082/800], Step[0100/0626], Avg Loss: 0.6882
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6875
+INFO:master_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0200/0626], Avg Loss: 0.6884
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6882
+INFO:master_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[082/800], Step[0300/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6879
+INFO:master_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0400/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6872
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6874
+INFO:master_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[082/800], Step[0500/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6873
+INFO:master_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[082/800], Step[0600/0626], Avg Loss: 0.6880
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6876, time: 851.19
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6870, time: 851.29
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6876, time: 851.29
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6881, time: 851.15
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6881, time: 851.69
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6873, time: 851.08
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6877, time: 847.31
+INFO:master_logger:----- Epoch[082/800], Train Loss: 0.6876, time: 847.31
+INFO:local_logger:----- Epoch[082/800], Train Loss: 0.6879, time: 851.21
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-82-Loss-0.687688001142508.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-82-Loss-0.687688001142508.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-82-Loss-0.687688001142508.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-82-Loss-0.687688001142508.pdopt
+INFO:local_logger:Now training epoch 83. LR=0.000153
+INFO:master_logger:Now training epoch 83. LR=0.000153
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6878
+INFO:master_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6793
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6913
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6781
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.7000
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[083/800], Step[0000/0626], Avg Loss: 0.6893
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6891
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6882
+INFO:master_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[083/800], Step[0100/0626], Avg Loss: 0.6872
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6890
+INFO:master_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[083/800], Step[0200/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6877
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6872
+INFO:master_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[083/800], Step[0300/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6879
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6873
+INFO:master_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[083/800], Step[0400/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6866
+INFO:master_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6872
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[083/800], Step[0500/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6872
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6871
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6866
+INFO:master_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6872
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[083/800], Step[0600/0626], Avg Loss: 0.6867
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6872, time: 894.39
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6868, time: 894.42
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6876, time: 894.42
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6871, time: 890.84
+INFO:master_logger:----- Epoch[083/800], Train Loss: 0.6872, time: 890.84
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6876, time: 894.60
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6873, time: 894.67
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6875, time: 894.68
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:----- Epoch[083/800], Train Loss: 0.6866, time: 894.74
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-83-Loss-0.6870564654976226.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-83-Loss-0.6870564654976226.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-83-Loss-0.6870564654976226.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-83-Loss-0.6870564654976226.pdopt
+INFO:local_logger:Now training epoch 84. LR=0.000153
+INFO:master_logger:Now training epoch 84. LR=0.000153
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6927
+INFO:master_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6881
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6745
+INFO:local_logger:Epoch[084/800], Step[0000/0626], Avg Loss: 0.6914
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6880
+INFO:master_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[084/800], Step[0100/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6867
+INFO:master_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6884
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[084/800], Step[0200/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6878
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6868
+INFO:master_logger:Epoch[084/800], Step[0300/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6876
+INFO:master_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0400/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6866
+INFO:master_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[084/800], Step[0500/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6875
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6867
+INFO:master_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[084/800], Step[0600/0626], Avg Loss: 0.6865
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6870, time: 854.61
+INFO:master_logger:----- Epoch[084/800], Train Loss: 0.6869, time: 854.61
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6868, time: 858.62
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6865, time: 858.88
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6866, time: 858.64
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6874, time: 858.90
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6868, time: 858.67
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6875, time: 858.57
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Epoch[084/800], Train Loss: 0.6864, time: 858.93
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-84-Loss-0.6870198997374206.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-84-Loss-0.6870198997374206.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-84-Loss-0.6870198997374206.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-84-Loss-0.6870198997374206.pdopt
+INFO:local_logger:Now training epoch 85. LR=0.000153
+INFO:master_logger:Now training epoch 85. LR=0.000153
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6915
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6830
+INFO:master_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6898
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6935
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6929
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[085/800], Step[0000/0626], Avg Loss: 0.6803
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6843
+INFO:master_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[085/800], Step[0100/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6865
+INFO:master_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[085/800], Step[0200/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6867
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6866
+INFO:master_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[085/800], Step[0300/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6869
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6866
+INFO:master_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[085/800], Step[0400/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6860
+INFO:master_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[085/800], Step[0500/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6859
+INFO:master_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[085/800], Step[0600/0626], Avg Loss: 0.6862
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6862, time: 892.65
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6862, time: 893.08
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6859, time: 890.01
+INFO:master_logger:----- Epoch[085/800], Train Loss: 0.6862, time: 890.01
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6864, time: 893.68
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6857, time: 893.70
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6863, time: 893.73
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6867, time: 893.72
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Epoch[085/800], Train Loss: 0.6864, time: 893.76
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-85-Loss-0.6859439270970955.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-85-Loss-0.6859439270970955.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-85-Loss-0.6859439270970955.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-85-Loss-0.6859439270970955.pdopt
+INFO:local_logger:Now training epoch 86. LR=0.000153
+INFO:master_logger:Now training epoch 86. LR=0.000153
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.7015
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.7112
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6865
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6773
+INFO:master_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6887
+INFO:local_logger:Epoch[086/800], Step[0000/0626], Avg Loss: 0.6796
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6876
+INFO:master_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[086/800], Step[0100/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6862
+INFO:master_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0200/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:master_logger:Epoch[086/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6862
+INFO:master_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0400/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6862
+INFO:master_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[086/800], Step[0500/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6858
+INFO:master_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[086/800], Step[0600/0626], Avg Loss: 0.6858
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6860, time: 859.95
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6859, time: 860.07
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6862, time: 856.92
+INFO:master_logger:----- Epoch[086/800], Train Loss: 0.6860, time: 856.92
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6860, time: 861.39
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6862, time: 860.97
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6857, time: 860.38
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6858, time: 860.35
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Epoch[086/800], Train Loss: 0.6859, time: 860.35
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-86-Loss-0.6862062182739747.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-86-Loss-0.6862062182739747.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-86-Loss-0.6862062182739747.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-86-Loss-0.6862062182739747.pdopt
+INFO:local_logger:Now training epoch 87. LR=0.000153
+INFO:master_logger:Now training epoch 87. LR=0.000153
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6883
+INFO:master_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6722
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6880
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6993
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6908
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[087/800], Step[0000/0626], Avg Loss: 0.6712
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6852
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6868
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6862
+INFO:master_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0100/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6864
+INFO:master_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[087/800], Step[0200/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:master_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0300/0626], Avg Loss: 0.6864
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6852
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6862
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6858
+INFO:master_logger:Epoch[087/800], Step[0400/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6852
+INFO:master_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[087/800], Step[0500/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6858
+INFO:master_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[087/800], Step[0600/0626], Avg Loss: 0.6852
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6853, time: 895.08
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6855, time: 895.09
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6860, time: 895.73
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6852, time: 896.01
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6856, time: 895.77
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6857, time: 896.11
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6857, time: 892.15
+INFO:master_logger:----- Epoch[087/800], Train Loss: 0.6856, time: 892.15
+INFO:local_logger:----- Epoch[087/800], Train Loss: 0.6855, time: 895.77
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-87-Loss-0.6857299549603768.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-87-Loss-0.6857299549603768.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-87-Loss-0.6857299549603768.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-87-Loss-0.6857299549603768.pdopt
+INFO:local_logger:Now training epoch 88. LR=0.000153
+INFO:master_logger:Now training epoch 88. LR=0.000153
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6931
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6886
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6765
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6795
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6839
+INFO:master_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[088/800], Step[0000/0626], Avg Loss: 0.6800
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6876
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6868
+INFO:master_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[088/800], Step[0100/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6863
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6852
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6860
+INFO:master_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[088/800], Step[0200/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6861
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6859
+INFO:master_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0300/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6859
+INFO:master_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[088/800], Step[0400/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6857
+INFO:master_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[088/800], Step[0500/0626], Avg Loss: 0.6857
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6854
+INFO:master_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[088/800], Step[0600/0626], Avg Loss: 0.6853
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6855, time: 854.47
+INFO:master_logger:----- Epoch[088/800], Train Loss: 0.6853, time: 854.47
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6854, time: 859.85
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6849, time: 859.24
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6852, time: 859.32
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6854, time: 859.33
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6853, time: 859.35
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6853, time: 859.33
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:----- Epoch[088/800], Train Loss: 0.6856, time: 860.00
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-88-Loss-0.6854734038612285.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-88-Loss-0.6854734038612285.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-88-Loss-0.6854734038612285.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-88-Loss-0.6854734038612285.pdopt
+INFO:local_logger:Now training epoch 89. LR=0.000154
+INFO:master_logger:Now training epoch 89. LR=0.000154
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6780
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6912
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6884
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6960
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6791
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6747
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6772
+INFO:local_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6825
+INFO:master_logger:Epoch[089/800], Step[0000/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6850
+INFO:master_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[089/800], Step[0100/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6851
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6852
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6844
+INFO:master_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6852
+INFO:local_logger:Epoch[089/800], Step[0200/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6854
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6852
+INFO:local_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6847
+INFO:master_logger:Epoch[089/800], Step[0300/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6854
+INFO:master_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6851
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6856
+INFO:local_logger:Epoch[089/800], Step[0400/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6851
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6849
+INFO:master_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0500/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6845
+INFO:master_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[089/800], Step[0600/0626], Avg Loss: 0.6849
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6845, time: 884.60
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6849, time: 885.46
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6846, time: 882.71
+INFO:master_logger:----- Epoch[089/800], Train Loss: 0.6848, time: 882.71
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6850, time: 885.49
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6845, time: 885.58
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6848, time: 885.61
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6849, time: 885.54
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Epoch[089/800], Train Loss: 0.6849, time: 885.50
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-89-Loss-0.6845652029200572.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-89-Loss-0.6845652029200572.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-89-Loss-0.6845652029200572.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-89-Loss-0.6845652029200572.pdopt
+INFO:local_logger:Now training epoch 90. LR=0.000154
+INFO:master_logger:Now training epoch 90. LR=0.000154
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6924
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6763
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6748
+INFO:master_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6787
+INFO:local_logger:Epoch[090/800], Step[0000/0626], Avg Loss: 0.6883
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6851
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6853
+INFO:master_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[090/800], Step[0100/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6850
+INFO:master_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[090/800], Step[0200/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6851
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6841
+INFO:master_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0300/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:master_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[090/800], Step[0400/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6842
+INFO:master_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[090/800], Step[0500/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6842
+INFO:master_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[090/800], Step[0600/0626], Avg Loss: 0.6845
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6845, time: 851.60
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6839, time: 851.05
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6844, time: 851.17
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6844, time: 851.26
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6847, time: 851.26
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6844, time: 851.23
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6841, time: 847.55
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:master_logger:----- Epoch[090/800], Train Loss: 0.6843, time: 847.55
+INFO:local_logger:----- Epoch[090/800], Train Loss: 0.6840, time: 851.24
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-90-Loss-0.6841203801495528.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-90-Loss-0.6841203801495528.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-90-Loss-0.6841203801495528.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-90-Loss-0.6841203801495528.pdopt
+INFO:local_logger:Now training epoch 91. LR=0.000154
+INFO:master_logger:Now training epoch 91. LR=0.000154
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6872
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6756
+INFO:master_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6859
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6748
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6793
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6961
+INFO:local_logger:Epoch[091/800], Step[0000/0626], Avg Loss: 0.6897
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6833
+INFO:master_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[091/800], Step[0100/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6837
+INFO:master_logger:Epoch[091/800], Step[0200/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6834
+INFO:master_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[091/800], Step[0300/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6847
+INFO:master_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[091/800], Step[0400/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6840
+INFO:master_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[091/800], Step[0500/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6842
+INFO:master_logger:Epoch[091/800], Step[0600/0626], Avg Loss: 0.6842
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6843, time: 886.91
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6838, time: 886.90
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6848, time: 887.37
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6844, time: 887.49
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6842, time: 887.83
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6839, time: 887.30
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6839, time: 883.60
+INFO:master_logger:----- Epoch[091/800], Train Loss: 0.6841, time: 883.60
+INFO:local_logger:----- Epoch[091/800], Train Loss: 0.6838, time: 887.32
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-91-Loss-0.6838902651592956.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-91-Loss-0.6838902651592956.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-91-Loss-0.6838902651592956.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-91-Loss-0.6838902651592956.pdopt
+INFO:local_logger:Now training epoch 92. LR=0.000154
+INFO:master_logger:Now training epoch 92. LR=0.000154
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6957
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6676
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6676
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6866
+INFO:master_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6936
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[092/800], Step[0000/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6840
+INFO:master_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[092/800], Step[0100/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6831
+INFO:master_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[092/800], Step[0200/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6839
+INFO:master_logger:Epoch[092/800], Step[0300/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6841
+INFO:master_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0400/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6840
+INFO:master_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[092/800], Step[0500/0626], Avg Loss: 0.6842
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6839
+INFO:master_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[092/800], Step[0600/0626], Avg Loss: 0.6836
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6839, time: 858.55
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6839, time: 858.56
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6836, time: 858.97
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6833, time: 858.75
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6837, time: 858.76
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6835, time: 858.85
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6841, time: 855.23
+INFO:master_logger:----- Epoch[092/800], Train Loss: 0.6837, time: 855.23
+INFO:local_logger:----- Epoch[092/800], Train Loss: 0.6836, time: 859.34
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-92-Loss-0.6840875268930822.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-92-Loss-0.6840875268930822.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-92-Loss-0.6840875268930822.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-92-Loss-0.6840875268930822.pdopt
+INFO:local_logger:Now training epoch 93. LR=0.000154
+INFO:master_logger:Now training epoch 93. LR=0.000154
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6662
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6933
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6938
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6859
+INFO:master_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6846
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6770
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6934
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[093/800], Step[0000/0626], Avg Loss: 0.6873
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6832
+INFO:master_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0100/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6835
+INFO:master_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[093/800], Step[0200/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6831
+INFO:master_logger:Epoch[093/800], Step[0300/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6830
+INFO:master_logger:Epoch[093/800], Step[0400/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6830
+INFO:master_logger:Epoch[093/800], Step[0500/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6827
+INFO:master_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[093/800], Step[0600/0626], Avg Loss: 0.6835
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6837, time: 883.02
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6832, time: 882.93
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6827, time: 879.60
+INFO:master_logger:----- Epoch[093/800], Train Loss: 0.6834, time: 879.60
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6832, time: 883.71
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6834, time: 883.93
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6837, time: 883.92
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6836, time: 884.00
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Epoch[093/800], Train Loss: 0.6834, time: 883.71
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-93-Loss-0.6827037774682657.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-93-Loss-0.6827037774682657.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-93-Loss-0.6827037774682657.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-93-Loss-0.6827037774682657.pdopt
+INFO:local_logger:Now training epoch 94. LR=0.000154
+INFO:master_logger:Now training epoch 94. LR=0.000154
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6799
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6772
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6722
+INFO:master_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6904
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6963
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6866
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[094/800], Step[0000/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6834
+INFO:master_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[094/800], Step[0100/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6847
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6828
+INFO:master_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[094/800], Step[0200/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6849
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6828
+INFO:master_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[094/800], Step[0300/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6845
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6831
+INFO:master_logger:Epoch[094/800], Step[0400/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6832
+INFO:master_logger:Epoch[094/800], Step[0500/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6840
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6828
+INFO:master_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[094/800], Step[0600/0626], Avg Loss: 0.6825
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6835, time: 860.64
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6829, time: 861.10
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6834, time: 861.04
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6827, time: 861.88
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6838, time: 861.14
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6830, time: 857.69
+INFO:master_logger:----- Epoch[094/800], Train Loss: 0.6831, time: 857.69
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6825, time: 861.06
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Epoch[094/800], Train Loss: 0.6830, time: 861.17
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-94-Loss-0.6830039001247509.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-94-Loss-0.6830039001247509.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-94-Loss-0.6830039001247509.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-94-Loss-0.6830039001247509.pdopt
+INFO:local_logger:Now training epoch 95. LR=0.000155
+INFO:master_logger:Now training epoch 95. LR=0.000155
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6832
+INFO:master_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6771
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6798
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6706
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6905
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6843
+INFO:local_logger:Epoch[095/800], Step[0000/0626], Avg Loss: 0.6855
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:master_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[095/800], Step[0100/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6829
+INFO:master_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0200/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6833
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6820
+INFO:master_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0300/0626], Avg Loss: 0.6835
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6826
+INFO:master_logger:Epoch[095/800], Step[0400/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6832
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6828
+INFO:master_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[095/800], Step[0500/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6825
+INFO:master_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[095/800], Step[0600/0626], Avg Loss: 0.6826
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6831, time: 887.46
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6830, time: 886.49
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6825, time: 886.66
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6825, time: 886.64
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6829, time: 886.77
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6826, time: 886.77
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6826, time: 883.07
+INFO:master_logger:----- Epoch[095/800], Train Loss: 0.6827, time: 883.07
+INFO:local_logger:----- Epoch[095/800], Train Loss: 0.6827, time: 886.80
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-95-Loss-0.6825694100624208.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-95-Loss-0.6825694100624208.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-95-Loss-0.6825694100624208.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-95-Loss-0.6825694100624208.pdopt
+INFO:local_logger:Now training epoch 96. LR=0.000155
+INFO:master_logger:Now training epoch 96. LR=0.000155
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6925
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6757
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6684
+INFO:master_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6799
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6683
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6801
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6900
+INFO:local_logger:Epoch[096/800], Step[0000/0626], Avg Loss: 0.6837
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6836
+INFO:master_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6841
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[096/800], Step[0100/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6825
+INFO:master_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[096/800], Step[0200/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6819
+INFO:master_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[096/800], Step[0300/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6822
+INFO:master_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[096/800], Step[0400/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6821
+INFO:master_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[096/800], Step[0500/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6827
+INFO:master_logger:Epoch[096/800], Step[0600/0626], Avg Loss: 0.6824
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6821, time: 868.69
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6821, time: 869.09
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6823, time: 868.99
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6826, time: 868.81
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6824, time: 868.69
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6827, time: 864.96
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:master_logger:----- Epoch[096/800], Train Loss: 0.6824, time: 864.96
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6828, time: 868.82
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:----- Epoch[096/800], Train Loss: 0.6826, time: 868.66
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-96-Loss-0.6826821214191926.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-96-Loss-0.6826821214191926.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-96-Loss-0.6826821214191926.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-96-Loss-0.6826821214191926.pdopt
+INFO:local_logger:Now training epoch 97. LR=0.000155
+INFO:master_logger:Now training epoch 97. LR=0.000155
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6880
+INFO:master_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6940
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6844
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6755
+INFO:local_logger:Epoch[097/800], Step[0000/0626], Avg Loss: 0.6928
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6822
+INFO:master_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6831
+INFO:local_logger:Epoch[097/800], Step[0100/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:master_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[097/800], Step[0200/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6822
+INFO:master_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[097/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:master_logger:Epoch[097/800], Step[0400/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6826
+INFO:master_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[097/800], Step[0500/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6825
+INFO:master_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[097/800], Step[0600/0626], Avg Loss: 0.6814
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6822, time: 881.18
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6818, time: 882.30
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6822, time: 882.31
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6826, time: 882.32
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6813, time: 882.38
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6820, time: 882.40
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6818, time: 878.64
+INFO:master_logger:----- Epoch[097/800], Train Loss: 0.6820, time: 878.64
+INFO:local_logger:----- Epoch[097/800], Train Loss: 0.6823, time: 882.40
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-97-Loss-0.6818455600972856.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-97-Loss-0.6818455600972856.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-97-Loss-0.6818455600972856.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-97-Loss-0.6818455600972856.pdopt
+INFO:local_logger:Now training epoch 98. LR=0.000155
+INFO:master_logger:Now training epoch 98. LR=0.000155
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6930
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6821
+INFO:master_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6860
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6848
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6756
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.7011
+INFO:local_logger:Epoch[098/800], Step[0000/0626], Avg Loss: 0.6899
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6824
+INFO:master_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[098/800], Step[0100/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6826
+INFO:master_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0200/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6822
+INFO:master_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[098/800], Step[0300/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6827
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6817
+INFO:master_logger:Epoch[098/800], Step[0400/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6823
+INFO:master_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[098/800], Step[0500/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6816
+INFO:master_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[098/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6819, time: 870.20
+INFO:master_logger:----- Epoch[098/800], Train Loss: 0.6820, time: 870.20
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6823, time: 875.16
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6820, time: 874.01
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6825, time: 874.00
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6817, time: 874.10
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6822, time: 874.02
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6815, time: 874.10
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Epoch[098/800], Train Loss: 0.6816, time: 874.02
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-98-Loss-0.681889634827903.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-98-Loss-0.681889634827903.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-98-Loss-0.681889634827903.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-98-Loss-0.681889634827903.pdopt
+INFO:local_logger:Now training epoch 99. LR=0.000155
+INFO:master_logger:Now training epoch 99. LR=0.000155
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6968
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6870
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6865
+INFO:master_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6853
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6660
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6719
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6836
+INFO:local_logger:Epoch[099/800], Step[0000/0626], Avg Loss: 0.6839
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6834
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:master_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[099/800], Step[0100/0626], Avg Loss: 0.6826
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6828
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6822
+INFO:master_logger:Epoch[099/800], Step[0200/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6811
+INFO:master_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[099/800], Step[0300/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6825
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6816
+INFO:master_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[099/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:master_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[099/800], Step[0500/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6817
+INFO:master_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[099/800], Step[0600/0626], Avg Loss: 0.6817
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6811, time: 873.44
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6817, time: 874.12
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6814, time: 874.14
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6815, time: 874.31
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6816, time: 874.69
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6820, time: 874.75
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6814, time: 871.01
+INFO:master_logger:----- Epoch[099/800], Train Loss: 0.6816, time: 871.01
+INFO:local_logger:----- Epoch[099/800], Train Loss: 0.6820, time: 874.69
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-99-Loss-0.6813920508235197.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-99-Loss-0.6813920508235197.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-99-Loss-0.6813920508235197.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-99-Loss-0.6813920508235197.pdopt
+INFO:local_logger:Now training epoch 100. LR=0.000155
+INFO:master_logger:Now training epoch 100. LR=0.000155
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6874
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6858
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6763
+INFO:master_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6794
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6727
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6920
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6764
+INFO:local_logger:Epoch[100/800], Step[0000/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6824
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6801
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6821
+INFO:master_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[100/800], Step[0100/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6812
+INFO:master_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[100/800], Step[0200/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6812
+INFO:master_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[100/800], Step[0300/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6815
+INFO:master_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[100/800], Step[0400/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6812
+INFO:master_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[100/800], Step[0500/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6814
+INFO:master_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[100/800], Step[0600/0626], Avg Loss: 0.6814
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6815, time: 871.01
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6814, time: 870.82
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6814, time: 871.44
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6812, time: 871.50
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6812, time: 867.05
+INFO:master_logger:----- Epoch[100/800], Train Loss: 0.6813, time: 867.05
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6811, time: 872.20
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6814, time: 870.95
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Epoch[100/800], Train Loss: 0.6814, time: 871.00
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-100-Loss-0.6812047341083004.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-100-Loss-0.6812047341083004.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-100-Loss-0.6812047341083004.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-100-Loss-0.6812047341083004.pdopt
+INFO:local_logger:Now training epoch 101. LR=0.000156
+INFO:master_logger:Now training epoch 101. LR=0.000156
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6922
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6755
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6834
+INFO:master_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6772
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6966
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6690
+INFO:local_logger:Epoch[101/800], Step[0000/0626], Avg Loss: 0.6717
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6798
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6819
+INFO:master_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[101/800], Step[0100/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6813
+INFO:master_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[101/800], Step[0200/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6807
+INFO:master_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[101/800], Step[0300/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6810
+INFO:master_logger:Epoch[101/800], Step[0400/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6810
+INFO:master_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0500/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6808
+INFO:master_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[101/800], Step[0600/0626], Avg Loss: 0.6809
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6805, time: 867.69
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6806, time: 868.18
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6806, time: 868.58
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6808, time: 864.53
+INFO:master_logger:----- Epoch[101/800], Train Loss: 0.6808, time: 864.53
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6807, time: 868.26
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6810, time: 868.29
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6811, time: 868.42
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Epoch[101/800], Train Loss: 0.6812, time: 868.25
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-101-Loss-0.6808065795139766.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-101-Loss-0.6808065795139766.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-101-Loss-0.6808065795139766.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-101-Loss-0.6808065795139766.pdopt
+INFO:local_logger:Now training epoch 102. LR=0.000156
+INFO:master_logger:Now training epoch 102. LR=0.000156
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6861
+INFO:master_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6794
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6910
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6753
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6669
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6722
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6901
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6797
+INFO:local_logger:Epoch[102/800], Step[0000/0626], Avg Loss: 0.6743
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6830
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6822
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6813
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6817
+INFO:master_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[102/800], Step[0100/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6807
+INFO:master_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6819
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[102/800], Step[0200/0626], Avg Loss: 0.6803
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6814
+INFO:master_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6821
+INFO:local_logger:Epoch[102/800], Step[0300/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6816
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6810
+INFO:master_logger:Epoch[102/800], Step[0400/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6820
+INFO:local_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6811
+INFO:master_logger:Epoch[102/800], Step[0500/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6817
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6810
+INFO:master_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6811
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[102/800], Step[0600/0626], Avg Loss: 0.6813
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6812, time: 872.04
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6812, time: 868.78
+INFO:master_logger:----- Epoch[102/800], Train Loss: 0.6811, time: 868.78
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6809, time: 872.56
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6805, time: 872.77
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6808, time: 873.71
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6809, time: 873.10
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6815, time: 873.19
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Epoch[102/800], Train Loss: 0.6815, time: 873.09
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-102-Loss-0.681159841605351.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-102-Loss-0.681159841605351.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-102-Loss-0.681159841605351.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-102-Loss-0.681159841605351.pdopt
+INFO:local_logger:Now training epoch 103. LR=0.000156
+INFO:master_logger:Now training epoch 103. LR=0.000156
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6787
+INFO:master_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6646
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6794
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6850
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6838
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6909
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6721
+INFO:local_logger:Epoch[103/800], Step[0000/0626], Avg Loss: 0.6917
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6811
+INFO:master_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6829
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6810
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6795
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6823
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6818
+INFO:local_logger:Epoch[103/800], Step[0100/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6799
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6810
+INFO:master_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6815
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6797
+INFO:local_logger:Epoch[103/800], Step[0200/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6803
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6799
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6803
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6801
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6806
+INFO:master_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[103/800], Step[0300/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6800
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6805
+INFO:master_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6814
+INFO:local_logger:Epoch[103/800], Step[0400/0626], Avg Loss: 0.6803
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6812
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6807
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6808
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6804
+INFO:master_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[103/800], Step[0500/0626], Avg Loss: 0.6804
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6806
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6801
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6802
+INFO:master_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6805
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6809
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6802
+INFO:local_logger:Epoch[103/800], Step[0600/0626], Avg Loss: 0.6806
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6801, time: 859.70
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6803, time: 859.89
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6805, time: 859.89
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6809, time: 856.80
+INFO:master_logger:----- Epoch[103/800], Train Loss: 0.6805, time: 856.80
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6806, time: 860.28
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6801, time: 859.89
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6802, time: 859.91
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Epoch[103/800], Train Loss: 0.6809, time: 860.42
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-103-Loss-0.6808819352382769.pdparams
+INFO:local_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-103-Loss-0.6808819352382769.pdopt
+INFO:master_logger:----- Save model: ./output/train-20211219-17-07-40/MAE-Epoch-103-Loss-0.6808819352382769.pdparams
+INFO:master_logger:----- Save optim: ./output/train-20211219-17-07-40/MAE-Epoch-103-Loss-0.6808819352382769.pdopt
+INFO:local_logger:Now training epoch 104. LR=0.000156
+INFO:master_logger:Now training epoch 104. LR=0.000156
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6583
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6660
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6799
+INFO:master_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6722
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6668
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6885
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6790
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6707
+INFO:local_logger:Epoch[104/800], Step[0000/0626], Avg Loss: 0.6680
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+0   paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1639995159 (unix time) try "date -d @1639995159" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x84e5) received by PID 25456 (TID 0x7f771efbe700) from PID 34021 ***]
+
+
+
+--------------------------------------
+C++ Traceback (most recent call last):
+--------------------------------------
+0   paddle::platform::GpuMemcpySync(void*, void const*, unsigned long, cudaMemcpyKind)
+
+----------------------
+Error Message Summary:
+----------------------
+FatalError: `Termination signal` is detected by the operating system.
+  [TimeInfo: *** Aborted at 1639995171 (unix time) try "date -d @1639995171" if you are using GNU date ***]
+  [SignalInfo: *** SIGTERM (@0x84e5) received by PID 25537 (TID 0x7fcf37fc6700) from PID 34021 ***]
+
+Traceback (most recent call last):
+  File "main_multi_gpu_pretrain.py", line 416, in <module>
+    main()
+  File "main_multi_gpu_pretrain.py", line 412, in main
+    dist.spawn(main_worker, args=(config, dataset_train, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 502, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 312, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 320, in _throw_exception
+    (error_index, name))
+Exception: Process 7 terminated with signal SIGTERM.
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
+  len(cache))
+/opt/conda/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
+  len(cache))
diff --git a/image_classification/MAE/run_finetune.sh b/image_classification/MAE/run_finetune.sh
new file mode 100644
index 00000000..c4d60575
--- /dev/null
+++ b/image_classification/MAE/run_finetune.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu_finetune.py \
+-cfg='./configs/vit_base_patch16_224_finetune.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-amp \
+-pretrained='./output/train-20211203-14-42-46/MAE-Epoch-10-Loss-0'
diff --git a/image_classification/MAE/run_finetune_multi.sh b/image_classification/MAE/run_finetune_multi.sh
new file mode 100644
index 00000000..719a5cd1
--- /dev/null
+++ b/image_classification/MAE/run_finetune_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1 \
+python main_multi_gpu_finetune.py \
+-cfg='./configs/vit_base_patch16_224_finetune.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-amp \
diff --git a/image_classification/MAE/run_pretrain.sh b/image_classification/MAE/run_pretrain.sh
new file mode 100644
index 00000000..8c5b1b7b
--- /dev/null
+++ b/image_classification/MAE/run_pretrain.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu_pretrain.py \
+-cfg='./configs/vit_base_patch16_224_pretrain.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-mae_pretrain \
+#-amp
diff --git a/image_classification/MAE/run_pretrain_multi.sh b/image_classification/MAE/run_pretrain_multi.sh
new file mode 100644
index 00000000..6fb6b864
--- /dev/null
+++ b/image_classification/MAE/run_pretrain_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4 \
+python main_multi_gpu_pretrain.py \
+-cfg='./configs/vit_base_patch16_224_pretrain_dec1.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-mae_pretrain \
+#-amp
diff --git a/image_classification/MAE/run_pretrain_multi_resume.sh b/image_classification/MAE/run_pretrain_multi_resume.sh
new file mode 100644
index 00000000..1ff2fd94
--- /dev/null
+++ b/image_classification/MAE/run_pretrain_multi_resume.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu_pretrain.py \
+-cfg='./configs/vit_base_patch16_224_pretrain.yaml' \
+-dataset='imagenet2012' \
+-batch_size=256 \
+-data_path='/dataset/imagenet' \
+-resume='./output/train-20211210-08-41-14/MAE-Epoch-12-Loss-0.9377176860235059' \
+-last_epoch=12 \
+-mae_pretrain \
+-amp
diff --git a/image_classification/MAE/stat_define.py b/image_classification/MAE/stat_define.py
new file mode 100644
index 00000000..963482d7
--- /dev/null
+++ b/image_classification/MAE/stat_define.py
@@ -0,0 +1,61 @@
+import os
+import glob
+import paddle
+from config import get_config
+from transformer import build_mae_pretrain as build_model
+
+def count_gelu(layer, inputs, output):
+    activation_flops = 8
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, inputs, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, inputs, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/vit_large_patch32_384.yaml'
+#input_size = (1, 3, 224, 224)
+input_size = (1, 3, 384, 384)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/MAE/tests/__init__.py b/image_classification/MAE/tests/__init__.py
new file mode 100644
index 00000000..84952a81
--- /dev/null
+++ b/image_classification/MAE/tests/__init__.py
@@ -0,0 +1 @@
+# init
\ No newline at end of file
diff --git a/image_classification/MAE/tests/test_config.py b/image_classification/MAE/tests/test_config.py
new file mode 100644
index 00000000..6806e8a1
--- /dev/null
+++ b/image_classification/MAE/tests/test_config.py
@@ -0,0 +1,72 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import argparse
+from config import update_config, get_config
+
+class ConfigTest(unittest.TestCase):
+    def setUp(self):
+        parser = argparse.ArgumentParser('')
+        parser.add_argument('-cfg', type=str, default=None)
+        parser.add_argument('-dataset', type=str, default="cifar10")
+        parser.add_argument('-batch_size', type=int, default=128)
+        parser.add_argument('-image_size', type=int, default=256)
+        parser.add_argument('-ngpus', type=int, default=None)
+        parser.add_argument('-data_path', type=str, default='/cifar10/')
+        parser.add_argument('-eval', action='store_false') # enable eval
+        parser.add_argument('-pretrained', type=str, default='pretrained')
+        parser.add_argument('-resume', type=str, default=None)
+        parser.add_argument('-last_epoch', type=int, default=None)
+        self.args = parser.parse_args()
+
+    def tearDown(self):
+        pass
+
+    def test_update_config(self):
+        config = get_config()
+        config = update_config(config, self.args)
+
+        self.assertEqual(config.DATA.DATASET, 'cifar10')
+        self.assertEqual(config.DATA.BATCH_SIZE, 128)
+        self.assertEqual(config.DATA.IMAGE_SIZE, 256)
+        self.assertEqual(config.DATA.DATA_PATH, '/cifar10/')
+        self.assertEqual(config.EVAL, True)
+        self.assertEqual(config.DATA.BATCH_SIZE_EVAL, 128)
+        self.assertEqual(config.MODEL.PRETRAINED, 'pretrained')
+
+    def test_update_config_from_file(self):
+        config = get_config()
+        self.args.cfg = './tests/test_config.yaml'
+        self.args.image_size = None
+        self.args.ngpus = None
+        config = update_config(config, self.args)
+
+        self.assertEqual(config.DATA.IMAGE_SIZE, 384)
+        self.assertEqual(config.DATA.CROP_PCT, 1.0)
+
+        self.assertEqual(config.MODEL.TRANS.PATCH_SIZE, 16)
+        self.assertEqual(config.MODEL.TRANS.EMBED_DIM, 768)
+        self.assertEqual(config.MODEL.TRANS.MLP_RATIO, 4.0)
+        self.assertEqual(config.MODEL.TRANS.DEPTH, 12)
+        self.assertEqual(config.MODEL.TRANS.NUM_HEADS, 12)
+        self.assertEqual(config.MODEL.TRANS.QKV_BIAS, True)
+
+        self.assertEqual(config.MODEL.NAME, 'vit_base_patch16_224')
+        self.assertEqual(config.MODEL.TYPE, 'ViT')
+
+    def test_get_config(self):
+        config1 = get_config()
+        config2 = get_config()
+        self.assertEqual(config1, config2)
diff --git a/image_classification/MAE/tests/test_config.yaml b/image_classification/MAE/tests/test_config.yaml
new file mode 100644
index 00000000..19709906
--- /dev/null
+++ b/image_classification/MAE/tests/test_config.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+
diff --git a/image_classification/MAE/tests/test_datasets.py b/image_classification/MAE/tests/test_datasets.py
new file mode 100644
index 00000000..79952137
--- /dev/null
+++ b/image_classification/MAE/tests/test_datasets.py
@@ -0,0 +1,147 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import argparse
+from config import *
+from datasets import *
+from paddle.io import DataLoader
+#from multiprocessing import SimpleQueue
+
+#paddle.set_device('cpu')
+
+class DatasetTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        parser = argparse.ArgumentParser('')
+        parser.add_argument('-cfg', type=str, default=None)
+        parser.add_argument('-dataset', type=str, default='imagenet2012')
+        parser.add_argument('-batch_size', type=int, default=4)
+        parser.add_argument('-image_size', type=int, default=224)
+        parser.add_argument('-ngpus', type=int, default=None)
+        parser.add_argument('-data_path', type=str, default='/dataset/imagenet')
+        parser.add_argument('-eval', action='store_true')
+        parser.add_argument('-pretrained', type=str, default=None)
+        parser.add_argument('-resume', type=str, default=None)
+        parser.add_argument('-last_epoch', type=int, default=None)
+        cls.args = parser.parse_args()
+        cls.config = get_config()
+        cls.config = update_config(cls.config, cls.args)
+
+        cls.dataset_train = get_dataset(DatasetTest.config, mode='train')
+        cls.dataset_test = get_dataset(DatasetTest.config, mode='val')
+
+    @classmethod 
+    def tearDown(cls):
+        pass
+
+    @unittest.skip('skip for debug')
+    def test_shape(self):
+        sample = next(iter(DatasetTest.dataset_train))
+        self.assertEqual([3, 224, 224], sample[0].shape)
+
+        sample = next(iter(DatasetTest.dataset_test))
+        self.assertEqual([3, 224, 224], sample[0].shape)
+    
+    @unittest.skip('skip for debug')
+    def test_scaling(self):
+        sample = next(iter(DatasetTest.dataset_train))[0]
+        self.assertTrue(paddle.any(sample < 0))
+        self.assertTrue(paddle.any(sample > 0))
+        self.assertGreaterEqual(1, sample.max().cpu().numpy())
+        self.assertLessEqual(-1, sample.min().cpu().numpy())
+
+        sample = next(iter(DatasetTest.dataset_test))[0]
+        self.assertGreaterEqual(1, sample.max().cpu().numpy())
+        self.assertLessEqual(-1, sample.min().cpu().numpy())
+        self.assertTrue(paddle.any(sample < 0))
+        self.assertTrue(paddle.any(sample > 0))
+
+    @unittest.skip('skip for debug')
+    def test_single_process_dataloader(self):
+        self._test_loader(DatasetTest.dataset_train, 'train', False)
+        self._test_loader(DatasetTest.dataset_test, 'test', False)
+
+    def _test_loader(self, dataset, mode, multi_process):
+        dataloader = get_dataloader(DatasetTest.config,
+                                    dataset,
+                                    mode=mode,
+                                    multi_process=multi_process)
+        for idx, _ in enumerate(dataloader):
+            if idx > 0 and idx % 1 == 0:
+                print(f'----- test single process dataloader: {idx}/{len(dataloader)}')
+            if idx == 10:
+                return
+
+    @unittest.skip('skip for debug')
+    def test_multi_process_dataloader(self):
+        tester = Tester()
+        tester.run()
+        self.assertEqual(tester.n_samples, 50000) 
+            
+
+
+
+class Tester:
+    def __init__(self):
+        parser = argparse.ArgumentParser('')
+        parser.add_argument('-cfg', type=str, default=None)
+        parser.add_argument('-dataset', type=str, default='imagenet2012')
+        parser.add_argument('-batch_size', type=int, default=256)
+        parser.add_argument('-image_size', type=int, default=224)
+        parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+        parser.add_argument('-eval', action='store_false') # set test batch size
+        parser.add_argument('-pretrained', type=str, default=None)
+        args = parser.parse_args()
+        self.config = get_config()
+        self.config = update_config(self.config, args)
+        self.dataset_train = get_dataset(self.config, mode='train')
+        self.dataset_test = get_dataset(self.config, mode='val')
+        self.n_samples = 0
+
+    def run(self, mode='test'):
+        # https://github.com/PaddlePaddle/Paddle/blob/5d8e4395b61929627151f6fd4a607589288a78bf/python/paddle/distributed/spawn.py#L272
+        context = dist.spawn(self.main_worker, args=(mode,))
+        self.n_samples = context.return_queues[0].get()
+        print(f'----- total samples: {self.n_samples}')
+
+    def main_worker(self, *args):
+        mode = args[0]
+        dist.init_parallel_env()
+        local_rank = dist.get_rank()
+        if mode == 'train':
+            n_samples = self._test_loader(self.config, self.dataset_train, 'train', True) 
+        else:
+            n_samples = self._test_loader(self.config, self.dataset_test, 'test', True) 
+
+        n_samples = paddle.to_tensor(np.array([n_samples]))
+        dist.reduce(n_samples, 0)
+        if local_rank == 0:
+            return n_samples.cpu().numpy()
+
+
+    def _test_loader(self, config, dataset, mode, multi_process):
+        n_samples = 0
+        dataloader = get_dataloader(config,
+                                    dataset,
+                                    mode=mode,
+                                    multi_process=multi_process)
+        local_rank = dist.get_rank()
+        for idx, data in enumerate(dataloader):
+            if idx > 0 and idx % 1 == 0:
+                print(f'----- test single process({local_rank}) dataloader: {idx}/{len(dataloader)}')
+                #print(local_rank, data[1])
+            n_samples += data[0].shape[0] 
+
+        return n_samples
diff --git a/image_classification/MAE/tests/test_transformer.py b/image_classification/MAE/tests/test_transformer.py
new file mode 100644
index 00000000..bbfefc49
--- /dev/null
+++ b/image_classification/MAE/tests/test_transformer.py
@@ -0,0 +1,115 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from config import *
+from transformer import build_mae_pretrain
+from transformer import PatchEmbedding
+from transformer import Attention
+from transformer import Mlp
+from transformer import Encoder
+
+
+class TransformerTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        paddle.set_device('cpu')
+        cls.config = get_config()
+        cls.dummy_img = np.random.randn(4, 3, 224, 224).astype('float32')
+        cls.dummy_tensor = paddle.to_tensor(cls.dummy_img)
+        cls.mae = build_mae_pretrain(cls.config)
+        cls.mae.train()
+
+    @classmethod
+    def tearDown(cls):
+        pass
+
+    # @unittest.skip('skip for debug')
+    def test_out_shape(self):
+        reconstruct, mask = TransformerTest.mae(TransformerTest.dummy_tensor)
+        self.assertEqual(reconstruct.shape, [4, 49, 768])
+        self.assertEqual(mask.shape, [4, 49, 768])
+
+    @unittest.skip('skip for debug')
+    def test_all_parameters_updated(self):
+        optim = paddle.optimizer.SGD(parameters=TransformerTest.mae.parameters(), learning_rate=0.1)
+        reconstruct, masked_image = TransformerTest.mae(TransformerTest.dummy_tensor)
+        loss = F.mse_loss(reconstruct, masked_image)
+        loss.backward()
+
+        for name, param in TransformerTest.mae.named_parameters():
+            if not param.stop_gradient:
+                self.assertIsNotNone(param.gradient())
+                # self.assertNotEqual(0, np.sum(param.gradient() ** 2))
+
+    # @unittest.skip('skip for debug')
+    def test_embeddings(self):
+        embed = PatchEmbedding()
+        dummy_img = np.random.randn(4, 3, 224, 224).astype('float32')
+        dummy_tensor = paddle.to_tensor(dummy_img)
+
+        patch_out = embed.patch_embedding(dummy_tensor)
+        embed_out = embed(dummy_tensor)
+        self.assertEqual(patch_out.shape, [4, 768, 14, 14])
+        self.assertEqual(embed.cls_token.shape, [1, 1, 768])
+        self.assertEqual(embed_out.shape, [4, 14 * 14 + 1, 768])
+
+    # @unittest.skip('skip for debug')
+    def test_attention(self):
+        attn_op = Attention(
+            TransformerTest.config.MODEL.TRANS.ENCODER.EMBED_DIM,
+            TransformerTest.config.MODEL.TRANS.ENCODER.NUM_HEADS,
+            TransformerTest.config.MODEL.TRANS.QKV_BIAS)
+        dummy_img = np.random.randn(4, 50, 768).astype('float32')
+        dummy_tensor = paddle.to_tensor(dummy_img)
+
+        out, attn = attn_op(dummy_tensor)
+        self.assertEqual(attn.shape, [4, 12, 50, 50])
+        self.assertEqual(out.shape, [4, 50, 768])
+
+    def test_mlp(self):
+        mlp_op = Mlp(
+            TransformerTest.config.MODEL.TRANS.ENCODER.EMBED_DIM,
+            TransformerTest.config.MODEL.TRANS.MLP_RATIO)
+        dummy_img = np.random.randn(4, 50, 768).astype('float32')
+        dummy_tensor = paddle.to_tensor(dummy_img)
+
+        out = mlp_op(dummy_tensor)
+        self.assertEqual(out.shape, [4, 50, 768])
+
+    def test_position_embedding_not_update(self):
+        origin = TransformerTest.mae.position_embedding.get_encoder_embedding().clone()
+        optim = paddle.optimizer.SGD(parameters=TransformerTest.mae.parameters(), learning_rate=0.1)
+        reconstruct, masked_image = TransformerTest.mae(TransformerTest.dummy_tensor)
+        loss = F.mse_loss(reconstruct, masked_image)
+        loss.backward()
+        optim.step()
+        update = TransformerTest.mae.position_embedding.get_encoder_embedding().clone()
+        self.assertTrue((origin.numpy() == update.numpy()).all())
+
+    def test_encoder(self):
+        encoder_op = Encoder(
+            TransformerTest.config.MODEL.TRANS.ENCODER.EMBED_DIM,
+            TransformerTest.config.MODEL.TRANS.ENCODER.NUM_HEADS,
+            TransformerTest.config.MODEL.TRANS.ENCODER.DEPTH,
+        )
+        dummy_img = np.random.randn(4, 50, 768).astype('float32')
+        dummy_tensor = paddle.to_tensor(dummy_img)
+
+        out, _ = encoder_op(dummy_tensor)
+        self.assertEqual(out.shape, [4, 50, 768])
diff --git a/image_classification/MAE/tests/test_utils.py b/image_classification/MAE/tests/test_utils.py
new file mode 100644
index 00000000..49366af4
--- /dev/null
+++ b/image_classification/MAE/tests/test_utils.py
@@ -0,0 +1,90 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import paddle
+import paddle.nn as nn
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+
+
+class UtilTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        pass
+
+    @classmethod
+    def tearDown(cls):
+        pass
+
+    def test_average_meter(self):
+        meter = AverageMeter()
+        for i in range(1, 101):
+            meter.update(i, 1)
+        self.assertEqual(meter.avg, 50.5)
+
+    def test_warmup_cosine_scheduler(self):
+        sch = WarmupCosineScheduler(learning_rate=0.1,
+                                    warmup_start_lr=1e-5,
+                                    start_lr=0.1,
+                                    end_lr=0.0,
+                                    warmup_epochs=10,
+                                    total_epochs=100,
+                                    last_epoch=-1)
+        lrs = []
+        for epoch in range(100):
+            lr = sch.get_lr()
+            lrs.append(lr)
+            sch.step()
+        lrs.append(sch.get_lr())
+
+        self.assertEqual(lrs[0], 1e-5)
+        self.assertEqual(lrs[10], 0.1)
+        self.assertEqual(lrs[-1], 0.0)
+        self.assertGreaterEqual(min(lrs[0:10]), 1e-5)
+        self.assertLessEqual(max(lrs[0:10]), 0.1)
+        self.assertGreaterEqual(min(lrs[10::]), 0.0)
+        self.assertLessEqual(max(lrs[10::]), 0.1)
+            
+    def test_warmup_cosine_scheduler_last_epoch(self):
+        sch = WarmupCosineScheduler(learning_rate=0.1,
+                                    warmup_start_lr=1e-5,
+                                    start_lr=0.1,
+                                    end_lr=0.0,
+                                    warmup_epochs=10,
+                                    total_epochs=100,
+                                    last_epoch=9)
+        lrs = []
+        for epoch in range(10, 100):
+            lr = sch.get_lr()
+            lrs.append(lr)
+            sch.step()
+        lrs.append(sch.get_lr())
+
+        self.assertEqual(lrs[0], 0.1)
+        self.assertEqual(lrs[-1], 0.0)
+        self.assertGreaterEqual(min(lrs[::]), 0.0)
+        self.assertLessEqual(max(lrs[::]), 0.1)
+
+    def test_get_exclude_from_weight_decay_fn(self):
+        model = nn.Linear(10, 100, bias_attr=True)
+        exclude_list = ['bias']
+        fn = get_exclude_from_weight_decay_fn(exclude_list)
+        # should return false if name in exclude_list 
+        for name, param in model.named_parameters():
+            if name.endswith('weight'):
+                self.assertTrue(fn(name))
+            elif name.endswith('bias'):
+                self.assertFalse(fn(name))
diff --git a/image_classification/MAE/transformer.py b/image_classification/MAE/transformer.py
new file mode 100644
index 00000000..62704ed8
--- /dev/null
+++ b/image_classification/MAE/transformer.py
@@ -0,0 +1,661 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for ViT
+"""
+
+import copy
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from droppath import DropPath
+from config import get_config
+
+
+def get_position_encoding(seq_len, embed_dim):
+    """ sinusoid position encoding table"""
+    def get_position_angle_vec(embed_dim, position):
+        return [position / np.power(10000, 2 * (hid_j // 2) / embed_dim) for hid_j in range(embed_dim)]
+
+    sinusoid_table = np.array([get_position_angle_vec(embed_dim, pos_i) for pos_i in range(seq_len)])
+    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
+    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
+    position_embedding = paddle.to_tensor([sinusoid_table])
+    return position_embedding
+
+
+class Identity(nn.Layer):
+    """ Identity layer
+    The output of this layer is the input without any change.
+    Use this layer to avoid using 'if' condition in forward methods
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class PositionalEmbedding(nn.Layer):
+    """Position Embedding
+
+    Apply positional embedding on input images.
+
+    Attributes:
+        position_embedding: sine-cosine version positional embedding
+    """
+    def __init__(self, embed_dim, seq_len=197):
+        """ Sinusoid position encoding table """
+        super().__init__()
+        self.seq_len = seq_len
+
+        def get_position_angle_vec(embed_dim, position):
+            return [position / np.power(10000, 2 * (hid_j // 2) / embed_dim) for hid_j in range(embed_dim)]
+
+        sinusoid_table = np.array([get_position_angle_vec(
+            embed_dim, pos_i) for pos_i in range(seq_len)])
+        sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
+        sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
+        position_embedding = paddle.to_tensor([sinusoid_table])
+
+        self.register_buffer('position_embedding',
+                             position_embedding)
+
+    def get_positional_embedding(self, seq_length=None):
+        if seq_length is None:
+            seq_length = self.seq_len
+        return self.position_embedding[:, :seq_length, :]
+
+
+class PatchEmbedding(nn.Layer):
+    """Patch Embedding
+
+    Apply patch embedding on input images.
+
+    Attributes:
+        patch_embddings: impl using a patch_size x patch_size Conv2D operation
+        cls_token: token insert to the patch feature for classification
+        dropout: dropout for embeddings
+    """
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 embed_dim=768,
+                 dropout=0.):
+        super().__init__()
+        n_patches = (image_size // patch_size) * (image_size // patch_size)
+
+        self.patch_embedding = nn.Conv2D(in_channels=in_channels,
+                                         out_channels=embed_dim,
+                                         kernel_size=patch_size,
+                                         stride=patch_size)
+
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 1, embed_dim],
+            dtype='float32',
+            default_initializer=paddle.nn.initializer.Constant(0))
+
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        cls_tokens = self.cls_token.expand(
+            (x.shape[0], -1, -1))
+        x = self.patch_embedding(x)
+        x = x.flatten(2)
+        x = x.transpose([0, 2, 1])
+        x = paddle.concat((cls_tokens, x), axis=1)
+        embeddings = self.dropout(x)
+        return embeddings
+
+
+class Attention(nn.Layer):
+    """ Attention module
+
+    Attention module for ViT, here q, k, v are assumed the same.
+    The qkv mappings are stored as one single param.
+
+    Attributes:
+        num_heads: number of heads
+        attn_head_size: feature dim of single head
+        all_head_size: feature dim of all heads
+        qkv: a nn.Linear for q, k, v mapping
+        scales: 1 / sqrt(single_head_feature_dim)
+        out: projection of multi-head attention
+        attn_dropout: dropout for attention
+        proj_dropout: final dropout before output
+        softmax: softmax op for attention
+    """
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.attn_head_size = int(embed_dim / self.num_heads)
+        self.all_head_size = self.attn_head_size * self.num_heads
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_size * 3,  # weights for q, k, and v
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else False)
+
+        self.scales = self.attn_head_size ** -0.5
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.out = nn.Linear(embed_dim,
+                             embed_dim,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+
+        self.attn_dropout = nn.Dropout(attention_dropout)
+        self.proj_dropout = nn.Dropout(dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(
+            initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(
+            initializer=nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def transpose_multihead(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def forward(self, x):
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multihead, qkv)
+
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = attn * self.scales
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v)
+        z = z.transpose([0, 2, 1, 3])
+        new_shape = z.shape[:-2] + [self.all_head_size]
+        z = z.reshape(new_shape)
+        # reshape
+        z = self.out(z)
+        z = self.proj_dropout(z)
+        return z
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+    def __init__(self,
+                 embed_dim,
+                 mlp_ratio,
+                 dropout=0.):
+        super().__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(embed_dim,
+                             int(embed_dim * mlp_ratio),
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio),
+                             embed_dim,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = nn.GELU()
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(
+            initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(
+            initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout1(x)
+        x = self.fc2(x)
+        x = self.dropout2(x)
+        return x
+
+
+class TransformerLayer(nn.Layer):
+    """Transformer Layer
+
+    Transformer Layer contains attention, norm, mlp and residual
+
+    Attributes:
+        hidden_size: transformer feature dim
+        attn_norm: nn.LayerNorm before attention
+        mlp_norm: nn.LayerNorm before mlp
+        mlp: mlp modual
+        attn: attention modual
+    """
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 qkv_bias=True,
+                 mlp_ratio=4.,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.attn_norm = nn.LayerNorm(embed_dim,
+                                      weight_attr=w_attr_1,
+                                      bias_attr=b_attr_1,
+                                      epsilon=1e-6)
+
+        self.attn = Attention(embed_dim,
+                              num_heads,
+                              qkv_bias,
+                              dropout,
+                              attention_dropout)
+        self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.mlp_norm = nn.LayerNorm(embed_dim,
+                                     weight_attr=w_attr_2,
+                                     bias_attr=b_attr_2,
+                                     epsilon=1e-6)
+
+        self.mlp = Mlp(embed_dim, mlp_ratio, dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        h = x
+        x = self.attn_norm(x)
+        x = self.attn(x)
+        x = self.drop_path(x)
+        x = x + h
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = self.drop_path(x)
+        x = x + h
+
+        return x
+
+
+class Encoder(nn.Layer):
+    """Transformer encoder
+
+    Encoder contains a list of TransformerLayer, and a LayerNorm.
+
+    Attributes:
+        layers: nn.LayerList contains multiple TransformerLayers
+        encoder_norm: nn.LayerNorm which is applied after last encoder layer
+    """
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 depth,
+                 qkv_bias=True,
+                 mlp_ratio=4.0,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        # stochatic depth decay
+        depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
+        layer_list = []
+        for i in range(depth):
+            layer_list.append(TransformerLayer(embed_dim,
+                                             num_heads,
+                                             qkv_bias,
+                                             mlp_ratio,
+                                             dropout,
+                                             attention_dropout,
+                                             droppath=depth_decay[i]))
+            # new paddle version fix this, deepcopy is no longer needed
+            # layer_list.append(copy.deepcopy(encoder_layer))
+        self.layers = nn.LayerList(layer_list)
+        
+        w_attr, b_attr = self._init_weights()
+        self.encoder_norm = nn.LayerNorm(embed_dim,
+                                         weight_attr=w_attr,
+                                         bias_attr=b_attr,
+                                         epsilon=1e-6)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        out = self.encoder_norm(x)
+        return out
+
+
+class Decoder(nn.Layer):
+    """Transformer decoder
+
+        Decoder contains a list of TransformerLayer, and a LayerNorm.
+
+        Attributes:
+            layers: nn.LayerList contains multiple TransformerLayers
+            decoder_norm: nn.LayerNorm which is applied after last encoder layer
+        """
+
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 depth,
+                 qkv_bias=True,
+                 mlp_ratio=4.0,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        # stochatic depth decay
+        depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
+
+        layer_list = []
+        for i in range(depth):
+            layer_list.append(TransformerLayer(embed_dim,
+                                               num_heads,
+                                               qkv_bias,
+                                               mlp_ratio,
+                                               dropout,
+                                               attention_dropout,
+                                               droppath=depth_decay[i]))
+            # new paddle version fix this, deepcopy is no longer needed
+            # layer_list.append(copy.deepcopy(encoder_layer))
+        self.layers = nn.LayerList(layer_list)
+
+        w_attr, b_attr = self._init_weights()
+        self.decoder_norm = nn.LayerNorm(embed_dim,
+                                         weight_attr=w_attr,
+                                         bias_attr=b_attr,
+                                         epsilon=1e-6)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x, mask_len=0):
+        for layer in self.layers:
+            x = layer(x)
+        if mask_len > 0:
+            # only sustain masked patches
+            out = self.decoder_norm(x[:, -mask_len:])
+        else:
+            out = self.decoder_norm(x)
+        return out
+
+
+class MAEPretrainTransformer(nn.Layer):
+    """ViT transformer
+
+    ViT Transformer, classifier is a single Linear layer for finetune,
+    For training from scratch, two layer mlp should be used.
+    Classification is done using cls_token.
+
+    Args:
+        image_size: int, input image size, default: 224
+        patch_size: int, patch size, default: 16
+        in_channels: int, input image channels, default: 3
+        num_classes: int, number of classes for classification, default: 1000
+        encoder_embed_dim: int, embedding dimension (patch embed out dim), default: 768
+        decoder_embed_dim: int, embedding dimension (patch embed out dim), default: 512
+        encoder_depth: int, number ot transformer blocks, default: 12
+        num_heads: int, number of attention heads, default: 12
+        mlp_ratio: float, ratio of mlp hidden dim to embed dim(mlp in dim), default: 4.0
+        qkv_bias: bool, If True, enable qkv(nn.Linear) layer with bias, default: True
+        dropout: float, dropout rate for linear layers, default: 0.
+        attention_dropout: float, dropout rate for attention layers default: 0.
+        droppath: float, droppath rate for droppath layers, default: 0.
+    """
+
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 encoder_embed_dim=768,
+                 decoder_embed_dim=512,
+                 encoder_depth=12,
+                 decoder_depth=8,
+                 encoder_num_heads=12,
+                 decoder_num_heads=8,
+                 mlp_ratio=4,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.patch_size = patch_size
+        self.num_patches = (image_size // patch_size) * (image_size // patch_size)
+        self.mask_token = paddle.create_parameter(
+            shape=[1, 1, decoder_embed_dim],
+            dtype='float32',
+            default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        self.perm = None
+        self.mask_num = None
+        # create positional embedding
+        self.encoder_position_embedding = get_position_encoding(seq_len=1 + self.num_patches,
+                                                                embed_dim=encoder_embed_dim) 
+        self.decoder_position_embedding = get_position_encoding(seq_len=1 + self.num_patches,
+                                                                embed_dim=decoder_embed_dim) 
+        # create patch embedding with positional embedding
+        self.patch_embedding = PatchEmbedding(image_size,
+                                              patch_size,
+                                              in_channels,
+                                              encoder_embed_dim,
+                                              dropout)
+        # create multi head self-attention encoder
+        self.encoder = Encoder(encoder_embed_dim,
+                               encoder_num_heads,
+                               encoder_depth,
+                               qkv_bias,
+                               mlp_ratio,
+                               dropout,
+                               attention_dropout,
+                               droppath)
+        # the embed_dim is different in encoder and decoder, so add a linear layer
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.linear_projection = nn.Linear(encoder_embed_dim,
+                                           decoder_embed_dim,
+                                           weight_attr=w_attr_1,
+                                           bias_attr=b_attr_1)
+        # create multi head self-attention decoder
+        self.decoder = Decoder(decoder_embed_dim,
+                               decoder_num_heads,
+                               decoder_depth,
+                               qkv_bias,
+                               mlp_ratio,
+                               dropout,
+                               attention_dropout,
+                               droppath)
+        # create reconstruction layer
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.reconstruction_layer = nn.Linear(decoder_embed_dim,
+                                              in_channels * patch_size * patch_size,
+                                              weight_attr=w_attr_2,
+                                              bias_attr=b_attr_2)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(
+            initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(
+            initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def forward(self, x, masks):
+        # x: [B, C, H, W]
+        x = self.patch_embedding(x)
+        # x: [B, num_patches, embed_dim]
+        B, N, C = x.shape # B: batch_size, N: num_patches, C: embed_dim
+        # mask: [B, num_patches], visible set to 0, masked set to 1
+
+        # add pos embed
+        x += self.encoder_position_embedding.clone().detach()
+        # get no mask patches
+        no_mask_x = x[~masks] # [B*0.25*L, embed_dim]
+        # index slicing needs reshape back in paddle: [B, 0.25L, embed_dim]
+        no_mask_x = no_mask_x.reshape([B, -1, C])
+        # encoder
+        enc_out = self.encoder(no_mask_x)
+        # encoder to decoder linear proj
+        enc_out = self.linear_projection(enc_out)
+        # shuffle the position embedding is equivalent to unshuffling tokens 
+        expand_pos_embed = self.decoder_position_embedding.expand([B, -1, -1]).clone().detach()
+        pos_embed_no_mask = expand_pos_embed[~masks].reshape([B, -1, enc_out.shape[-1]])
+        pos_embed_mask = expand_pos_embed[masks].reshape([B, -1, enc_out.shape[-1]])
+        # dec in put, here use broadcasting for mask_token
+        dec_in = paddle.concat([enc_out + pos_embed_no_mask, self.mask_token + pos_embed_mask], axis=1)
+        # decoder
+        mask_len = pos_embed_mask.shape[1]
+        dec_out = self.decoder(dec_in, mask_len)
+        # reconstruct patches
+        output = self.reconstruction_layer(dec_out)
+        return output
+
+
+class MAEFinetuneTransformer(nn.Layer):
+    """ViT transformer
+
+    ViT Transformer, classifier is a single Linear layer for finetune,
+    For training from scratch, two layer mlp should be used.
+    Classification is done using cls_token.
+
+    Args:
+        image_size: int, input image size, default: 224
+        patch_size: int, patch size, default: 16
+        in_channels: int, input image channels, default: 3
+        num_classes: int, number of classes for classification, default: 1000
+        embed_dim: int, embedding dimension (patch embed out dim), default: 768
+        depth: int, number ot transformer blocks, default: 12
+        num_heads: int, number of attention heads, default: 12
+        mlp_ratio: float, ratio of mlp hidden dim to embed dim(mlp in dim), default: 4.0
+        qkv_bias: bool, If True, enable qkv(nn.Linear) layer with bias, default: True
+        dropout: float, dropout rate for linear layers, default: 0.
+        attention_dropout: float, dropout rate for attention layers default: 0.
+        droppath: float, droppath rate for droppath layers, default: 0.
+    """
+
+    def __init__(self,
+                 image_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=12,
+                 num_heads=12,
+                 mlp_ratio=4,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        self.num_patches = (image_size // patch_size) * (image_size // patch_size)
+        # create positional embedding
+        self.encoder_position_embedding = get_position_encoding(seq_len=1 + self.num_patches,
+                                                                embed_dim=embed_dim) 
+        # create patch embedding with positional embedding
+        self.patch_embedding = PatchEmbedding(image_size,
+                                              patch_size,
+                                              in_channels,
+                                              embed_dim,
+                                              dropout)
+        # create multi head self-attention encoder
+        self.encoder = Encoder(embed_dim,
+                               num_heads,
+                               depth,
+                               qkv_bias,
+                               mlp_ratio,
+                               dropout,
+                               attention_dropout,
+                               droppath)
+
+        # classifier head (for finetuning)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.classifier = nn.Linear(embed_dim,
+                                    num_classes,
+                                    weight_attr=w_attr_1,
+                                    bias_attr=b_attr_1)
+
+    def forward(self, x):
+        x = self.patch_embedding(x)
+        # add pos embed
+        x += self.encoder_position_embedding.clone().detach()
+        x = self.encoder(x)
+        logits = self.classifier(x[:, 0])  # take only cls_token as classifier
+        return logits
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+def build_mae_pretrain(config):
+    model = MAEPretrainTransformer(image_size=config.DATA.IMAGE_SIZE,
+                                   patch_size=config.MODEL.TRANS.PATCH_SIZE,
+                                   in_channels=3,
+                                   encoder_embed_dim=config.MODEL.TRANS.ENCODER.EMBED_DIM,
+                                   decoder_embed_dim=config.MODEL.TRANS.DECODER.EMBED_DIM,
+                                   encoder_depth=config.MODEL.TRANS.ENCODER.DEPTH,
+                                   decoder_depth=config.MODEL.TRANS.DECODER.DEPTH,
+                                   encoder_num_heads=config.MODEL.TRANS.ENCODER.NUM_HEADS,
+                                   decoder_num_heads=config.MODEL.TRANS.DECODER.NUM_HEADS,
+                                   mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+                                   qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                                   dropout=config.MODEL.DROPOUT,
+                                   attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+                                   droppath=config.MODEL.DROPPATH)
+    return model
+
+
+def build_mae_finetune(config):
+    model = MAEFinetuneTransformer(image_size=config.DATA.IMAGE_SIZE,
+                                   patch_size=config.MODEL.TRANS.PATCH_SIZE,
+                                   in_channels=3,
+                                   embed_dim=config.MODEL.TRANS.ENCODER.EMBED_DIM,
+                                   depth=config.MODEL.TRANS.ENCODER.DEPTH,
+                                   num_heads=config.MODEL.TRANS.ENCODER.NUM_HEADS,
+                                   mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+                                   qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                                   dropout=config.MODEL.DROPOUT,
+                                   attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+                                   droppath=config.MODEL.DROPPATH)
+    return model
diff --git a/image_classification/MAE/utils.py b/image_classification/MAE/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/MAE/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/MLP-Mixer/README.md b/image_classification/MLP-Mixer/README.md
index d10525cc..7894faef 100644
--- a/image_classification/MLP-Mixer/README.md
+++ b/image_classification/MLP-Mixer/README.md
@@ -13,13 +13,14 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 </p>
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-08-11): Model FLOPs and # params are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| mlp_mixer_b16_224                  | 76.60 | 92.23 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ZcQEH92sEPvYuDc6eYZgssK5UjYomzUD/view?usp=sharing)/[baidu](https://pan.baidu.com/s/12nZaWGMOXwrCMOIBfUuUMA)(xh8x) |
-| mlp_mixer_l16_224           | 72.06 | 87.67 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1mkmvqo5K7JuvqGm92a-AdycXIcsv1rdg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AmSVpwCaGR9Vjsj_boL7GA)(8q7r) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| mlp_mixer_b16_224            	| 76.60 | 92.23 | 60.0M   | 12.7G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ZcQEH92sEPvYuDc6eYZgssK5UjYomzUD/view?usp=sharing)/[baidu](https://pan.baidu.com/s/12nZaWGMOXwrCMOIBfUuUMA)(xh8x) |
+| mlp_mixer_l16_224           	| 72.06 | 87.67 | 208.2M  | 44.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1mkmvqo5K7JuvqGm92a-AdycXIcsv1rdg/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AmSVpwCaGR9Vjsj_boL7GA)(8q7r) |
 
 > *The results are evaluated on ImageNet2012 validation set.
 
@@ -68,8 +69,8 @@ from mlp_mixer import build_mlp_mixer as build_model
 config = get_config('./configs/mixer_b16_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./mixer_b16_224')
+# load pretrained weights
+model_state_dict = paddle.load('./mixer_b16_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -82,12 +83,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/mixer_b16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/mixer_b16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./mixer_b16_224'
+    -pretrained=/path/to/pretrained/model/mixer_b16_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -104,12 +105,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/mixer_b16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/mixer_b16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./mixer_b16_224'
+    -pretrained=/path/to/pretrained/model/mixer_b16_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -123,10 +124,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/mixer_b16_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/mixer_b16_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 <details>
@@ -143,10 +144,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/mixer_b16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/mixer_b16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/MLP-Mixer/__init__.py b/image_classification/MLP-Mixer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/MLP-Mixer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/MLP-Mixer/augment.py b/image_classification/MLP-Mixer/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/MLP-Mixer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/MLP-Mixer/config.py b/image_classification/MLP-Mixer/config.py
index 3dc24935..86a91247 100644
--- a/image_classification/MLP-Mixer/config.py
+++ b/image_classification/MLP-Mixer/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -35,6 +35,8 @@
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
 _C.DATA.CROP_PCT = 1.0 # input image scale ratio, scale is applied before centercrop in eval mode
 _C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.5, 0.5, 0.5] # [0.485, 0.456, 0.406]
+_C.DATA.IMAGENET_STD = [0.5, 0.5, 0.5] # [0.229, 0.224, 0.225]
 
 # model settings
 _C.MODEL = CN()
@@ -43,8 +45,9 @@
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
-_C.MODEL.DROPPATH = 0.1
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
 
 # transformer settings
 _C.MODEL.MIXER = CN()
@@ -56,13 +59,14 @@
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.01 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 1e-5
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -76,6 +80,24 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
@@ -84,8 +106,9 @@
 _C.VALIDATE_FREQ = 20 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
-_C.NGPUS = 1
+_C.NGPUS = -1
 
 
 def _update_config_from_file(config, cfg_file):
@@ -117,8 +140,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -130,6 +157,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/MLP-Mixer/datasets.py b/image_classification/MLP-Mixer/datasets.py
index e207f9ba..304df9a3 100644
--- a/image_classification/MLP-Mixer/datasets.py
+++ b/image_classification/MLP-Mixer/datasets.py
@@ -19,8 +19,20 @@
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -80,13 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        #transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -106,11 +141,10 @@ def get_val_transforms(config):
 
     scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
     transforms_val = transforms.Compose([
-        transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
+        transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        #transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
@@ -125,6 +159,7 @@ def get_dataset(config, mode='train'):
     Returns:
         dataset: dataset object
     """
+
     assert mode in ['train', 'val']
     if config.DATA.DATASET == "cifar10":
         if mode == 'train':
diff --git a/image_classification/MLP-Mixer/droppath.py b/image_classification/MLP-Mixer/droppath.py
index fcff05e9..c8fe8048 100644
--- a/image_classification/MLP-Mixer/droppath.py
+++ b/image_classification/MLP-Mixer/droppath.py
@@ -32,6 +32,7 @@ def drop_path(inputs, drop_prob=0., training=False):
     if drop_prob == 0. or not training:
         return inputs
     keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
     shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
     random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
     random_tensor = random_tensor.floor() # mask
diff --git a/image_classification/MLP-Mixer/losses.py b/image_classification/MLP-Mixer/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/MLP-Mixer/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/MLP-Mixer/main_multi_gpu.py b/image_classification/MLP-Mixer/main_multi_gpu.py
index b188a70f..e856e496 100644
--- a/image_classification/MLP-Mixer/main_multi_gpu.py
+++ b/image_classification/MLP-Mixer/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -25,54 +25,55 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from mlp_mixer import build_mlp_mixer as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from mlp_mixer import build_mlp_mixer as build_model
 
 
-parser = argparse.ArgumentParser('MLP-Mixer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('MLP-Mixer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,18 +81,28 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
         train_loss_meter.avg
         train_acc_meter.avg
@@ -100,63 +111,120 @@ def train(dataloader,
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -171,56 +239,140 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -242,7 +394,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -273,76 +427,120 @@ def main_worker(*args):
             #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -350,15 +548,33 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/MLP-Mixer/main_single_gpu.py b/image_classification/MLP-Mixer/main_single_gpu.py
index 77b3c591..e4a82077 100644
--- a/image_classification/MLP-Mixer/main_single_gpu.py
+++ b/image_classification/MLP-Mixer/main_single_gpu.py
@@ -1,5 +1,4 @@
-
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,53 +26,54 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from mlp_mixer import build_mlp_mixer as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from mlp_mixer import build_mlp_mixer as build_model
 
 
-parser = argparse.ArgumentParser('MLP-Mixer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('MLP-Mixer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -81,56 +81,82 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -139,19 +165,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -176,7 +203,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -188,24 +215,77 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -214,8 +294,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -227,9 +306,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -251,55 +330,65 @@ def main():
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip)
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
-        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -311,9 +400,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
diff --git a/image_classification/MLP-Mixer/mixup.py b/image_classification/MLP-Mixer/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/MLP-Mixer/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/MLP-Mixer/mlp_mixer.py b/image_classification/MLP-Mixer/mlp_mixer.py
index 287ff846..9985c8f1 100644
--- a/image_classification/MLP-Mixer/mlp_mixer.py
+++ b/image_classification/MLP-Mixer/mlp_mixer.py
@@ -239,5 +239,5 @@ def build_mlp_mixer(config):
                      embed_dim=config.MODEL.MIXER.HIDDEN_SIZE,
                      mlp_ratio=(0.5, 4.0),
                      dropout=config.MODEL.DROPOUT,
-                     droppath=config.MODEL.DROPPATH)
+                     droppath=config.MODEL.DROP_PATH)
     return model
diff --git a/image_classification/MLP-Mixer/port_weights/__init__.py b/image_classification/MLP-Mixer/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/MLP-Mixer/random_erasing.py b/image_classification/MLP-Mixer/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/MLP-Mixer/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/MLP-Mixer/run_train.sh b/image_classification/MLP-Mixer/run_train.sh
index 725fd11d..ae309e17 100644
--- a/image_classification/MLP-Mixer/run_train.sh
+++ b/image_classification/MLP-Mixer/run_train.sh
@@ -2,5 +2,6 @@ CUDA_VISIBLE_DEVICES=7 \
 python main_single_gpu.py \
 -cfg='./configs/mixer_b16_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/MLP-Mixer/run_train_multi.sh b/image_classification/MLP-Mixer/run_train_multi.sh
index 5537081f..ebe56ec1 100644
--- a/image_classification/MLP-Mixer/run_train_multi.sh
+++ b/image_classification/MLP-Mixer/run_train_multi.sh
@@ -2,6 +2,6 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 \
 python main_multi_gpu.py \
 -cfg='./configs/mixer_b16_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
--ngpus=4
+-amp
diff --git a/image_classification/MLP-Mixer/transforms.py b/image_classification/MLP-Mixer/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/MLP-Mixer/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/MobileFormer/README.md b/image_classification/MobileFormer/README.md
new file mode 100644
index 00000000..4576e2ab
--- /dev/null
+++ b/image_classification/MobileFormer/README.md
@@ -0,0 +1,217 @@
+# Mobile-Former: Bridging MobileNet and Transformer, [arxiv](https://arxiv.org/abs/2108.05895)
+
+PaddlePaddle training/validation code for MobileFormer.
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<table>
+    <tr>
+        <td style="text-align:center">
+            <img src='./mobileformer_arch.png'>
+            <br>
+            <h3>MobileFormer Model Overview</h3>
+        </td>
+        <td style="text-align:center">
+            <img src='./mbileformer_block.png'>
+            <br>
+            <h3>MobileFormer Sub-Block Overview</h3>
+        </td>
+    </tr>
+</table>
+
+### Update
+
+- Update(2021-11-26): Code is released.
+
+## Models Zoo
+
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| mobileformer_26m			| * | * |  3.227M   | 26M±    | 224        | 0.875    | bicubic       | * |
+| mobileformer_52m   		| * | * |  3.513M   | 52M±    | 224        | 0.875    | bicubic       | * |
+| mobileformer_96m			| * | * |  4.595M   | 96M±    | 224        | 0.875    | bicubic       | * |
+| mobileformer_151m  		| * | * |  7.616M   | 151M±   | 224        | 0.875    | bicubic       | * |
+| mobileformer_214m			| * | * |  9.416M   | 214M±  | 224        | 0.875    | bicubic       | * |
+| mobileformer_294m   		| * | * | 11.392M   | 294M±  | 224        | 0.875    | bicubic       | * |
+| mobileformer_508m   		| * | * | 14.013M   | 508M±   | 224        | 0.875    | bicubic       | * |
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+
+### Models trained from scratch using PaddleViT
+
+**(coming soon)**
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**(coming soon)**
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- PaddlePaddle>=2.1.0
+- yacs>=0.1.8
+
+## Data
+`ImageNet2012 dataset` is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./mobileformer_26m.pdparams`, to use the `mobileformer_26m` model in python:
+
+```python
+from config import get_config
+from mobileformer import build_mformer as build_model
+# config files in ./configs/
+config = get_config('./configs/mobileformer_26m.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./mobileformer_26m.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate `MobileFormer` model performance on ImageNet2012 with a `single GPU`, run the following script using command line:
+
+```shell
+sh run_eval.sh
+```
+
+or
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/mobileformer_26m.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=64 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/mobileformer_26m  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/mobileformer_26m.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=32 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/mobileformer_26m  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the `MobileFormer` model on ImageNet2012 with `single GPU`, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/mobileformer_26m.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=32 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/train \
+    -output=./output
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_single_gpu.py \
+    -cfg=./configs/mobileformer_26m.yaml \
+    -dataset=imagenet2012 \
+    -num_classes=1000 \
+    -batch_size=4 \
+    -image_size=224 \
+    -data_path=/path/to/dataset/imagenet/train \
+    -output=./output
+```
+
+</details>
+
+## Arguments
+- *`-cfg`*: path of model config file (.yaml), stored in `./configs`.
+- *`-dataset`*: dataset name, e.g., `imagenet2012`, `cifar10`, `cifar100`.
+- *`-data_path`*: dataset folder path
+- `-batch_size`: batch size，default: `32`.
+- `-image_size`: input image size，default`224`.
+- `-num_classes`: number of classes, default: `1000`.
+- `-output`: output folder for storing weights and logs，default: `./output`.
+- `-pretrained`: pretrain model weights file path, (`.pdparams` file ext is NOT needed) default: `None`.
+- `-resume`: resume model weight and opt file path, (`.paparams` and `.pdopts` file ext are NOT needed, default: `None`.
+- `-last_epoch`: start epoch，default: `None`.
+- `-save_freq`: number of epochs to save checkpoint，default: `1`.
+- `-log_freq`: number of iters to print logging，default: `100`.
+- `-validate_freq`: number of epochs to do validation during training，default: `10`.
+- `-accum_iter`: number of iteration for iter accumulation, default: 1.
+- `-num_workers`: number of workers for data loading，default: `1`.
+- `-ngpus`: number of GPUs to use，you can control GPUs by CUDA_VISIBLE_DEVICES, just set this to -1 default: `-1`.
+- `-eval`: start eval mode.
+- `-amp`: start amp training.
+
+> `-cfg`,`-dataset` and `-data_path` in `main_single_gpu.py` and `main_multi_gpu.py` are MUST-HAVE settings.
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@misc{chen2021mobileformer,
+      title={Mobile-Former: Bridging MobileNet and Transformer}, 
+      author={Yinpeng Chen and Xiyang Dai and Dongdong Chen and Mengchen Liu and Xiaoyi Dong and Lu Yuan and Zicheng Liu},
+      year={2021},
+      eprint={2108.05895},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/image_classification/MobileFormer/__init__.py b/image_classification/MobileFormer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/MobileFormer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/MobileFormer/attention.py b/image_classification/MobileFormer/attention.py
new file mode 100644
index 00000000..06fb9bce
--- /dev/null
+++ b/image_classification/MobileFormer/attention.py
@@ -0,0 +1,94 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Implement Attention Layer
+"""
+import paddle
+from paddle import nn
+
+class Attention(nn.Layer):
+    """Multi Head Attention
+        Params Info:
+            embed_dims: input token embed_dims
+            num_head: the number of head is in multi head attention
+            dropout_rate: the dropout rate of attention result
+            attn_dropout_rate: the dropout rate of attention distribution
+            qkv_bias: whether use the bias in qkv matrix
+    """
+    def __init__(self,
+                 embed_dims,
+                 num_head=1,
+                 dropout_rate=0.,
+                 attn_dropout_rate=0.,
+                 qkv_bias=True):
+        super(Attention, self).__init__(
+                 name_scope="Attention")
+        self.num_head = num_head
+        self.head_dims = embed_dims // num_head
+        self.scale = self.head_dims ** -0.5
+
+        linear_weight_attr, linear_bias_attr = self._linear_init()
+
+        self.qkv_proj = nn.Linear(in_features=embed_dims,
+                                  out_features=3*self.num_head*self.head_dims,
+                                  weight_attr=linear_weight_attr,
+                                  bias_attr=linear_bias_attr if qkv_bias else qkv_bias)
+        self.output = nn.Linear(in_features=self.num_head*self.head_dims,
+                                out_features=embed_dims,
+                                weight_attr=linear_weight_attr,
+                                bias_attr=linear_bias_attr)
+
+        self.softmax = nn.Softmax()
+        self.dropout = nn.Dropout(dropout_rate)
+        self.attn_dropout= nn.Dropout(attn_dropout_rate)
+
+    def _linear_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def transfer_shape(self, q, k, v):
+        B, M, _ = q.shape
+        q = q.reshape(shape=[B, M, self.num_head, self.head_dims])
+        q = q.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
+        k = k.reshape(shape=[B, M, self.num_head, self.head_dims])
+        k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
+        v = v.reshape(shape=[B, M, self.num_head, self.head_dims])
+        v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, h_d
+
+        return q, k, v
+
+    def forward(self, inputs):
+        B, M, D = inputs.shape
+        assert D % self.num_head == 0, \
+            "Erorr: Please make sure Token.D % "+\
+            "num_head == 0(now:{0}).".format(D % self.num_head)
+
+        qkv= self.qkv_proj(inputs)
+        q, k, v = qkv.chunk(3, axis=-1)
+        # B, n_h, M, h_d
+        q, k, v = self.transfer_shape(q, k, v)
+
+        attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, M, M
+        attn = attn * self.scale
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v) # B, n_h, M, h_d
+        z = z.transpose(perm=[0, 2, 1, 3]) # B, M, n_h, h_d
+        z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
+        z = self.output(z)
+        z = self.attn_dropout(z)
+        z = z + inputs
+
+        return z
\ No newline at end of file
diff --git a/image_classification/MobileFormer/augment.py b/image_classification/MobileFormer/augment.py
new file mode 100644
index 00000000..19276756
--- /dev/null
+++ b/image_classification/MobileFormer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/MobileFormer/config.py b/image_classification/MobileFormer/config.py
new file mode 100644
index 00000000..eb46752f
--- /dev/null
+++ b/image_classification/MobileFormer/config.py
@@ -0,0 +1,244 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+"""
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings - is ok
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 256 # train batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 128 # val batch_size for single GPU
+_C.DATA.DATA_PATH = 'ILSVRC2012_val/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'MobileFormer'
+_C.MODEL.NAME = 'MobileFormer_26M'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.1
+_C.MODEL.DROPPATH = 0.1
+_C.MODEL.ATTENTION_DROPOUT = 0.1
+_C.MODEL.MLP_DROPOUT = 0.1
+
+# mobileformer architecture settings
+_C.MODEL.MF = CN()
+_C.MODEL.MF.IN_CHANNELS = 3
+_C.MODEL.MF.TOKENS = [3, 128] # token size
+_C.MODEL.MF.NUM_HEAD = 4
+_C.MODEL.MF.MLP_RATIO= 2.0
+_C.MODEL.MF.ALPHA = 1.0
+_C.MODEL.MF.QKV_BIAS = True
+_C.MODEL.MF.POINTWISECONV_GROUPS = 4 # the groups of pointwise 1x1conv
+
+# mobileformer architecture settings -- dyrelu
+_C.MODEL.MF.DYRELU =  CN()
+_C.MODEL.MF.DYRELU.USE_DYRELU = True
+_C.MODEL.MF.DYRELU.REDUCE = 6.0
+_C.MODEL.MF.DYRELU.DYRELU_K = 2
+_C.MODEL.MF.DYRELU.COEFS = [1.0, 0.5]
+_C.MODEL.MF.DYRELU.CONSTS = [1.0, 0.0]
+
+# mobileformer architecture settings -- stem
+_C.MODEL.MF.STEM =  CN()
+_C.MODEL.MF.STEM.OUT_CHANNELS = 8
+_C.MODEL.MF.STEM.KERNELS = 3
+_C.MODEL.MF.STEM.STRIEDS = 2
+_C.MODEL.MF.STEM.PADDINGS = 1
+
+# mobileformer architecture settings -- lite_bottleneck
+_C.MODEL.MF.LITE_BNECK =  CN()
+_C.MODEL.MF.LITE_BNECK.IN_CHANNEL = 8
+_C.MODEL.MF.LITE_BNECK.HIDDEN_CHANNEL = 24
+_C.MODEL.MF.LITE_BNECK.OUT_CHANNEL = 12
+_C.MODEL.MF.LITE_BNECK.KERNEL = 3
+_C.MODEL.MF.LITE_BNECK.STRIED = 2
+_C.MODEL.MF.LITE_BNECK.PADDING = 1
+
+# mobileformer architecture settings -- block, defualt 26m
+_C.MODEL.MF.BLOCK =  CN()
+_C.MODEL.MF.BLOCK.IN_CHANNELS = [12, 12, 24, 24, 48, 48, 64, 96]
+_C.MODEL.MF.BLOCK.HIDDEN_CHANNELS = [36, 72, 72, 144, 192, 288, 384, 576]
+_C.MODEL.MF.BLOCK.OUT_CHANNELS = [12, 24, 24, 48, 48, 64, 96, 96]
+_C.MODEL.MF.BLOCK.KERNELS = [3, 3, 3, 3, 3, 3, 3, 3]
+_C.MODEL.MF.BLOCK.STRIEDS = [1, 2, 1, 2, 1, 1, 2, 1]
+_C.MODEL.MF.BLOCK.PADDINGS = [1, 1, 1, 1, 1, 1, 1, 1]
+
+# mobileformer architecture settings -- channel conv1x1
+_C.MODEL.MF.CHANNEL_CONV =  CN()
+_C.MODEL.MF.CHANNEL_CONV.IN_CHANNEL = 96
+_C.MODEL.MF.CHANNEL_CONV.OUT_CHANNEL = 576
+
+# mobileformer architecture settings -- conv1x1
+_C.MODEL.MF.HEAD =  CN()
+_C.MODEL.MF.HEAD.IN_CHANNEL = 96
+_C.MODEL.MF.HEAD.HIDDEN_FEATURE = 576
+
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 450
+_C.TRAIN.WARMUP_EPOCHS = 30
+_C.TRAIN.WEIGHT_DECAY = 0.08
+_C.TRAIN.BASE_LR = 8e-4
+_C.TRAIN.WARMUP_START_LR = 1e-7
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = 2.0 # Clip gradient norm
+_C.TRAIN.ACCUM_ITER = 1 # Gradient accumulation steps
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# -----------------------------------------------------------------------------
+# Augmentation settings
+# -----------------------------------------------------------------------------
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # How to apply mixup/cutmix params. Per "batch", "pair", or "elem"
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# -----------------------------------------------------------------------------
+# Misc
+# -----------------------------------------------------------------------------
+_C.TEST = CN()
+_C.TEST.CROP = True   # 预测时，是否使用裁剪
+
+# -----------------------------------------------------------------------------
+# Misc
+# -----------------------------------------------------------------------------
+_C.AMP = False
+_C.SAVE = "./output"
+_C.TAG = 'default'
+_C.SAVE_FREQ = 1 # Frequency to save checkpoint
+_C.REPORT_FREQ  = 100 # Frequency to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0 # Fixed random seed
+_C.EVAL = False
+_C.THROUGHPUT_MODE = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as f:
+        yaml_cfg = yaml.load(f, Loader=yaml.FullLoader)
+
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('=> merge config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    #config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    if args.model_type:
+        config.MODEL.MF.TYPE = args.model_type
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.save_freq:
+        config.SAVE_FREQ = args.save_freq
+    if args.log_freq:
+        config.REPORT_FREQ = args.log_freq
+    if args.validate_freq:
+        config.VALIDATE_FREQ = args.validate_freq 
+    if args.num_workers:
+        config.DATA.NUM_WORKERS = args.num_workers
+    if args.accum_iter: 
+        config.TRAIN.ACCUM_ITER = args.accum_iter
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    # output folder
+    config.SAVE = os.path.join(config.SAVE, config.MODEL.NAME, config.TAG)
+
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/MobileFormer/configs/mobileformer_151m.yaml b/image_classification/MobileFormer/configs/mobileformer_151m.yaml
new file mode 100644
index 00000000..2e8e7778
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_151m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_151M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [6, 192]
+        NUM_HEAD: 4
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 1
+        STEM:
+            OUT_CHANNELS: 12
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 12
+            HIDDEN_CHANNEL: 24
+            OUT_CHANNEL: 12
+            KERNEL: 3
+            STRIED: 1
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [12, 16, 16, 32, 32, 64, 64, 88, 88, 128, 128]
+            HIDDEN_CHANNELS: [72, 48, 96, 96, 192, 256, 384, 528, 528, 768, 768]
+            OUT_CHANNELS: [16, 16, 32, 32, 64, 64, 88, 88, 128, 128, 128]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 128
+            OUT_CHANNEL: 768
+        HEAD:
+            IN_CHANNEL: 768
+            HIDDEN_FEATURE: 1280
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 5.5
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/configs/mobileformer_214m.yaml b/image_classification/MobileFormer/configs/mobileformer_214m.yaml
new file mode 100644
index 00000000..2f54101f
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_214m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_214M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [6, 192]
+        NUM_HEAD: 4
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 1
+        STEM:
+            OUT_CHANNELS: 12
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 12
+            HIDDEN_CHANNEL: 24
+            OUT_CHANNEL: 12
+            KERNEL: 3
+            STRIED: 1
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [12, 20, 20, 40, 40, 80, 80, 112, 112, 160, 160]
+            HIDDEN_CHANNELS: [72, 60, 120, 160, 240, 320, 480, 672, 672, 960, 960]
+            OUT_CHANNELS: [20, 20, 40, 40, 80, 80, 112, 112, 160, 160, 160]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 160
+            OUT_CHANNEL: 960
+        HEAD:
+            IN_CHANNEL: 960
+            HIDDEN_FEATURE: 1600
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 4.65
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/configs/mobileformer_26m.yaml b/image_classification/MobileFormer/configs/mobileformer_26m.yaml
new file mode 100644
index 00000000..e1e84c7d
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_26m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_26M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [3, 128]
+        NUM_HEAD: 4
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 4
+        STEM:
+            OUT_CHANNELS: 8
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 8
+            HIDDEN_CHANNEL: 24
+            OUT_CHANNEL: 12
+            KERNEL: 3
+            STRIED: 2
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [12, 12, 24, 24, 48, 48, 64, 96]
+            HIDDEN_CHANNELS: [36, 72, 72, 144, 192, 288, 384, 576]
+            OUT_CHANNELS: [12, 24, 24, 48, 48, 64, 96, 96]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [1, 2, 1, 2, 1, 1, 2, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 96
+            OUT_CHANNEL: 576
+        HEAD:
+            IN_CHANNEL: 576
+            HIDDEN_FEATURE: 1024
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 6.0
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/configs/mobileformer_294m.yaml b/image_classification/MobileFormer/configs/mobileformer_294m.yaml
new file mode 100644
index 00000000..567048e3
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_294m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_294M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [6, 192]
+        NUM_HEAD: 8
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 1
+        STEM:
+            OUT_CHANNELS: 16
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 16
+            HIDDEN_CHANNEL: 32
+            OUT_CHANNEL: 16
+            KERNEL: 3
+            STRIED: 1
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [16, 24, 24, 48, 48, 96, 96, 128, 128, 192, 192]
+            HIDDEN_CHANNELS: [96, 96, 144, 192, 288, 384, 576, 768, 768, 1152, 1152]
+            OUT_CHANNELS: [24, 24, 48, 48, 96, 96, 128, 128, 192, 192, 192]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 192
+            OUT_CHANNEL: 1152
+        HEAD:
+            IN_CHANNEL: 1152
+            HIDDEN_FEATURE: 1920
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 3.8
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/configs/mobileformer_508m.yaml b/image_classification/MobileFormer/configs/mobileformer_508m.yaml
new file mode 100644
index 00000000..b1f9c60d
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_508m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_508M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [6, 192]
+        NUM_HEAD: 8
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 1
+        STEM:
+            OUT_CHANNELS: 24
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 24
+            HIDDEN_CHANNEL: 48
+            OUT_CHANNEL: 24
+            KERNEL: 3
+            STRIED: 1
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [24, 40, 40, 72, 72, 128, 128, 176, 176, 240, 240]
+            HIDDEN_CHANNELS: [144, 120, 240, 216, 432, 512, 768, 1056, 1056, 1440, 1440]
+            OUT_CHANNELS: [40, 40, 72, 72, 128, 128, 176, 176, 240, 240, 240]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 240
+            OUT_CHANNEL: 1440
+        HEAD:
+            IN_CHANNEL: 1440
+            HIDDEN_FEATURE: 1920
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 3.25
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/configs/mobileformer_52m.yaml b/image_classification/MobileFormer/configs/mobileformer_52m.yaml
new file mode 100644
index 00000000..5933fbd3
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_52m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_52M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [3, 128]
+        NUM_HEAD: 4
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 1
+        STEM:
+            OUT_CHANNELS: 8
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 8
+            HIDDEN_CHANNEL: 24
+            OUT_CHANNEL: 12
+            KERNEL: 3
+            STRIED: 1
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [12, 12, 24, 24, 48, 48, 64, 96]
+            HIDDEN_CHANNELS: [36, 72, 72, 144, 192, 288, 384, 576]
+            OUT_CHANNELS: [12, 24, 24, 48, 48, 64, 96, 96]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [1, 2, 1, 2, 1, 1, 2, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 96
+            OUT_CHANNEL: 576
+        HEAD:
+            IN_CHANNEL: 576
+            HIDDEN_FEATURE: 1024
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 4.4
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/configs/mobileformer_96m.yaml b/image_classification/MobileFormer/configs/mobileformer_96m.yaml
new file mode 100644
index 00000000..ba601f89
--- /dev/null
+++ b/image_classification/MobileFormer/configs/mobileformer_96m.yaml
@@ -0,0 +1,49 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: MobileFormer
+    NAME: MobileFormer_96M
+    DROPPATH: 0.1
+    DROPOUT: 0.1
+    MLP_DROPOUT: 0.1
+    ATTENTION_DROPOUT: 0.1
+    MF:
+        IN_CHANNELS: 3
+        TOKENS: [4, 128]
+        NUM_HEAD: 4
+        MLP_RATIO: 2.0
+        ALPHA: 1.0
+        QKV_BIAS: True
+        POINTWISECONV_GROUPS: 1
+        STEM:
+            OUT_CHANNELS: 12
+            KERNELS: 3
+            STRIEDS: 2
+            PADDINGS: 1
+        LITE_BNECK:
+            IN_CHANNEL: 12
+            HIDDEN_CHANNEL: 24
+            OUT_CHANNEL: 12
+            KERNEL: 3
+            STRIED: 1
+            PADDING: 1
+        BLOCK:
+            IN_CHANNELS: [12, 16, 32, 32, 64, 64, 88, 128]
+            HIDDEN_CHANNELS: [72, 96, 96, 192, 256, 384, 528, 768]
+            OUT_CHANNELS: [16, 32, 32, 64, 64, 88, 128, 128]
+            KERNELS: [3, 3, 3, 3, 3, 3, 3, 3]
+            STRIEDS: [2, 2, 1, 2, 1, 1, 2, 1]
+            PADDINGS: [1, 1, 1, 1, 1, 1, 1, 1]
+        CHANNEL_CONV:
+            IN_CHANNEL: 128
+            OUT_CHANNEL: 768
+        HEAD:
+            IN_CHANNEL: 768
+            HIDDEN_FEATURE: 1280
+        DYRELU:
+            USE_DYRELU: True
+            REDUCE: 4.0
+            DYRELU_K: 2
+            COEFS: [1.0, 0.5]
+            CONSTS: [1.0, 0.0]
\ No newline at end of file
diff --git a/image_classification/MobileFormer/datasets.py b/image_classification/MobileFormer/datasets.py
new file mode 100644
index 00000000..daacccff
--- /dev/null
+++ b/image_classification/MobileFormer/datasets.py
@@ -0,0 +1,207 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    if config.TEST.CROP:
+        scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+        transforms_val = transforms.Compose([
+            transforms.Resize(scale_size, interpolation='bicubic'),
+            transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+        ])
+    else:
+        transforms_val = transforms.Compose([
+            transforms.Resize(config.DATA.IMAGE_SIZE, interpolation='bicubic'),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+        ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+    Returns the related dataset object according to configs and mode(train/val)
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            dataset = datasets.Cifar10(mode='test', transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            dataset = datasets.Cifar100(mode='test', transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+        Multi-GPU loader is implements as distributedBatchSampler.
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
\ No newline at end of file
diff --git a/image_classification/MobileFormer/droppath.py b/image_classification/MobileFormer/droppath.py
new file mode 100644
index 00000000..617d187e
--- /dev/null
+++ b/image_classification/MobileFormer/droppath.py
@@ -0,0 +1,42 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Implement Multi-Branch Dropout Layer
+"""
+import paddle
+from paddle import nn
+
+class DropPath(nn.Layer):
+    """Multi-branch dropout layer -- Along the axis of Batch
+        Params Info:
+            p: droppath rate
+    """
+    def __init__(self,
+                 p=0.):
+        super(DropPath, self).__init__(
+                 name_scope="DropPath")
+        self.p = p
+
+    def forward(self, inputs):
+        if self.p > 0. and self.training:
+            keep_p = 1 - self.p
+            keep_p = paddle.to_tensor([keep_p])
+            # B, 1, 1....
+            shape = [inputs.shape[0]] + [1] * (inputs.ndim-1)
+            random_dr = keep_p + paddle.rand(shape=shape, dtype='float32')
+            random_sample = random_dr.floor() # floor to int--B
+            output = inputs.divide(keep_p) * random_sample
+            return output
+
+        return inputs
diff --git a/image_classification/MobileFormer/dyrelu.py b/image_classification/MobileFormer/dyrelu.py
new file mode 100644
index 00000000..d0435431
--- /dev/null
+++ b/image_classification/MobileFormer/dyrelu.py
@@ -0,0 +1,119 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Implement MLP And DYReLU Layer
+"""
+import paddle
+from paddle import nn
+
+class MLP(nn.Layer):
+    """Multi Layer Perceptron
+        Params Info:
+            in_features: input token feature size
+            out_features: output token feature size
+            mlp_ratio: the scale of hidden feature size
+            mlp_dropout_rate: the dropout rate of mlp layer output
+    """
+    def __init__(self,
+                 in_features,
+                 out_features=None,
+                 mlp_ratio=2,
+                 mlp_dropout_rate=0.,
+                 act=nn.GELU):
+        super(MLP, self).__init__(name_scope="MLP")
+        self.out_features = in_features if out_features is None else \
+                            out_features
+        linear_weight_attr, linear_bias_attr = self._linear_init()
+
+        self.fc1 = nn.Linear(in_features=in_features,
+                             out_features=int(mlp_ratio*in_features),
+                             weight_attr=linear_weight_attr,
+                             bias_attr=linear_bias_attr)
+        self.fc2 = nn.Linear(in_features=int(mlp_ratio*in_features),
+                             out_features=self.out_features,
+                             weight_attr=linear_weight_attr,
+                             bias_attr=linear_bias_attr)
+
+        self.act = act()
+        self.dropout = nn.Dropout(mlp_dropout_rate)
+
+    def _linear_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def forward(self, inputs):
+        x = self.fc1(inputs)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+class DyReLU(nn.Layer):
+    """Dynamic ReLU activation function -- use one MLP
+        Params Info:
+            in_channels: input feature map channels
+            embed_dims: input token embed_dims
+            k: the number of parameters is in Dynamic ReLU
+            coefs: the init value of coefficient parameters
+            consts: the init value of constant parameters
+            reduce: the mlp hidden scale,
+                    means 1/reduce = mlp_ratio
+    """
+    def __init__(self,
+                 in_channels,
+                 embed_dims,
+                 k=2, # a_1, a_2 coef, b_1, b_2 bias
+                 coefs=[1.0, 0.5], # coef init value
+                 consts=[1.0, 0.0], # const init value
+                 reduce=4):
+        super(DyReLU, self).__init__(
+                 name_scope="DyReLU")
+        self.embed_dims = embed_dims
+        self.in_channels = in_channels
+        self.k = k
+
+        self.mid_channels = 2*k*in_channels
+
+        # 4 values
+        # a_k = alpha_k + coef_k*x, 2
+        # b_k = belta_k + coef_k*x, 2
+        self.coef = paddle.to_tensor([coefs[0]]*k + [coefs[1]]*k)
+        self.const = paddle.to_tensor([consts[0]] + [consts[1]]*(2*k-1))
+
+        self.project = nn.Sequential(
+            MLP(in_features=embed_dims,
+                out_features=self.mid_channels,
+                mlp_ratio=1/reduce,
+                act=nn.ReLU),
+            nn.BatchNorm(self.mid_channels)
+        )
+
+    def forward(self, feature_map, tokens):
+        B, M, D = tokens.shape
+        dy_params = self.project(tokens[:, 0]) # B, mid_channels
+        # B, IN_CHANNELS, 2*k
+        dy_params = dy_params.reshape(shape=[B, self.in_channels, 2*self.k])
+
+        # B, IN_CHANNELS, 2*k -- a_1, a_2, b_1, b_2
+        dy_init_params = dy_params * self.coef + self.const
+        f = feature_map.transpose(perm=[2, 3, 0, 1]).unsqueeze(axis=-1) # H, W, B, C, 1
+
+        # output shape: H, W, B, C, k
+        output = f * dy_init_params[:, :, :self.k] + dy_init_params[:, :, self.k:]
+        output = paddle.max(output, axis=-1) # H, W, B, C
+        output = output.transpose(perm=[2, 3, 0, 1]) # B, C, H, W
+
+        return output
diff --git a/image_classification/MobileFormer/losses.py b/image_classification/MobileFormer/losses.py
new file mode 100644
index 00000000..09a8ef28
--- /dev/null
+++ b/image_classification/MobileFormer/losses.py
@@ -0,0 +1,119 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
\ No newline at end of file
diff --git a/image_classification/MobileFormer/main_multi_gpu.py b/image_classification/MobileFormer/main_multi_gpu.py
new file mode 100644
index 00000000..47eef57b
--- /dev/null
+++ b/image_classification/MobileFormer/main_multi_gpu.py
@@ -0,0 +1,594 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MobileFormer training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from mobileformer import build_mformer as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Focal Transformer')
+    parser.add_argument('-model_type', type=str, default=None)
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=32)
+    parser.add_argument('-image_size', type=int, default=224)
+    parser.add_argument('-num_classes', type=int, default=1000)
+    parser.add_argument('-data_path', type=str, default=None)
+
+    parser.add_argument('-output', type=str, default=None)
+
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+
+    parser.add_argument('-save_freq', type=int, default=1)
+    parser.add_argument('-log_freq', type=int, default=100)
+    parser.add_argument('-validate_freq', type=int, default=10)
+    parser.add_argument('-accum_iter', type=int, default=1)
+    parser.add_argument('-num_workers', type=int, default=1)
+    parser.add_argument('-ngpus', type=int, default=-1)
+
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    linear_scaled_lr = (config.TRAIN.BASE_LR *
+        config.DATA.BATCH_SIZE * dist.get_world_size()) / 512.0
+    linear_scaled_warmup_start_lr = (config.TRAIN.WARMUP_START_LR *
+        config.DATA.BATCH_SIZE * dist.get_world_size()) / 512.0
+    linear_scaled_end_lr = (config.TRAIN.END_LR *
+        config.DATA.BATCH_SIZE * dist.get_world_size()) / 512.0
+
+    if config.TRAIN.ACCUM_ITER > 1:
+        linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+    
+    config.TRAIN.BASE_LR = linear_scaled_lr
+    config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+    config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MobileFormer/main_single_gpu.py b/image_classification/MobileFormer/main_single_gpu.py
new file mode 100644
index 00000000..d7b53720
--- /dev/null
+++ b/image_classification/MobileFormer/main_single_gpu.py
@@ -0,0 +1,433 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MobileFormer training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from mobileformer import build_mformer as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Focal Transformer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-model_type', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=32)
+    parser.add_argument('-image_size', type=int, default=224)
+    parser.add_argument('-num_classes', type=int, default=1000)
+    parser.add_argument('-data_path', type=str, default=None)
+
+    parser.add_argument('-output', type=str, default=None)
+
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+
+    parser.add_argument('-save_freq', type=int, default=1)
+    parser.add_argument('-log_freq', type=int, default=100)
+    parser.add_argument('-validate_freq', type=int, default=10)
+    parser.add_argument('-accum_iter', type=int, default=1)
+    parser.add_argument('-num_workers', type=int, default=1)
+    parser.add_argument('-ngpus', type=int, default=-1)
+
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    linear_scaled_lr = (config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / 512.0
+    linear_scaled_warmup_start_lr = (config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / 512.0
+    linear_scaled_end_lr = (config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / 512.0
+
+    if config.TRAIN.ACCUM_ITER > 1:
+        linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+        linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+
+    config.TRAIN.BASE_LR = linear_scaled_lr
+    config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+    config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MobileFormer/mbileformer_block.png b/image_classification/MobileFormer/mbileformer_block.png
new file mode 100644
index 00000000..7c4a03b0
Binary files /dev/null and b/image_classification/MobileFormer/mbileformer_block.png differ
diff --git a/image_classification/MobileFormer/mixup.py b/image_classification/MobileFormer/mixup.py
new file mode 100644
index 00000000..7dea0867
--- /dev/null
+++ b/image_classification/MobileFormer/mixup.py
@@ -0,0 +1,221 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
\ No newline at end of file
diff --git a/image_classification/MobileFormer/mobileformer.py b/image_classification/MobileFormer/mobileformer.py
new file mode 100644
index 00000000..7a5d2697
--- /dev/null
+++ b/image_classification/MobileFormer/mobileformer.py
@@ -0,0 +1,970 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Implement MobileFormer Arch
+"""
+import paddle
+from paddle import nn
+
+from dyrelu import DyReLU, MLP
+from droppath import DropPath
+from attention import Attention
+
+class Stem(nn.Layer):
+    """Stem
+    """
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 act=nn.Hardswish):
+        super(Stem, self).__init__(name_scope="Stem")
+        conv_weight_attr, conv_bias_attr = self._conv_init()
+        self.conv = nn.Conv2D(in_channels=in_channels,
+                              out_channels=out_channels,
+                              kernel_size=kernel_size,
+                              stride=stride,
+                              padding=padding,
+                              weight_attr=conv_weight_attr,
+                              bias_attr=conv_bias_attr)
+        self.bn = nn.BatchNorm2D(out_channels)
+        self.act = act()
+
+    def _conv_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def forward(self, inputs):
+        x = self.conv(inputs)
+        x = self.bn(x)
+        x = self.act(x)
+        return x
+
+
+class DepthWiseConv(nn.Layer):
+    """DepthWise Conv -- support lite weight dw_conv
+        Params Info:
+            is_lite: use lite weight dw_conv
+    """
+    def __init__(self,
+                 in_channels,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 is_lite=False):
+        super(DepthWiseConv, self).__init__(name_scope="DepthWiseConv")
+        self.is_lite = is_lite
+        conv_weight_attr, conv_bias_attr = self._conv_init()
+        if is_lite is False:
+            self.conv = nn.Conv2D(in_channels=in_channels,
+                                  out_channels=in_channels,
+                                  groups=in_channels,
+                                  kernel_size=kernel_size,
+                                  stride=stride,
+                                  padding=padding,
+                                  weight_attr=conv_weight_attr,
+                                  bias_attr=conv_bias_attr)
+        else:
+            self.conv = nn.Sequential(
+                # [[0, 1, 2]] -- [3, 1]
+                nn.Conv2D(in_channels=in_channels,
+                          out_channels=in_channels,
+                          kernel_size=[kernel_size, 1],
+                          stride=[stride, 1],
+                          padding=[padding, 0],
+                          groups=in_channels,
+                          weight_attr=conv_weight_attr,
+                          bias_attr=conv_bias_attr),
+                nn.BatchNorm2D(in_channels),
+                # [[0], [1], [2]] -- [1, 3]
+                nn.Conv2D(in_channels=in_channels,
+                          out_channels=in_channels,
+                          kernel_size=[1, kernel_size],
+                          stride=[1, stride],
+                          padding=[0, padding],
+                          groups=in_channels,
+                          weight_attr=conv_weight_attr,
+                          bias_attr=conv_bias_attr)
+            )
+
+    def _conv_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def forward(self, inputs):
+        x = self.conv(inputs)
+        return x
+
+
+class PointWiseConv(nn.Layer):
+    """PointWise 1x1Conv -- support group conv
+        Params Info:
+            groups: the number of groups
+    """
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 groups=1):
+        super(PointWiseConv, self).__init__(name_scope="PointWiseConv")
+        conv_weight_attr, conv_bias_attr = self._conv_init()
+        self.conv = nn.Conv2D(in_channels=in_channels,
+                              out_channels=out_channels,
+                              kernel_size=1,
+                              stride=1,
+                              padding=0,
+                              groups=groups,
+                              weight_attr=conv_weight_attr,
+                              bias_attr=conv_bias_attr)
+
+    def _conv_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def forward(self, inputs):
+        x = self.conv(inputs)
+        return x
+
+
+class BottleNeck(nn.Layer):
+    """BottleNeck
+        Params Info:
+            groups: the number of groups, by 1x1conv
+            embed_dims: input token embed_dims
+            k: the number of parameters is in Dynamic ReLU
+            coefs: the init value of coefficient parameters
+            consts: the init value of constant parameters
+            reduce: the mlp hidden scale,
+                    means 1/reduce = mlp_ratio
+            use_dyrelu: whether use dyrelu
+            is_lite: whether use lite dw_conv
+    """
+    def __init__(self,
+                 in_channels,
+                 hidden_channels,
+                 out_channels,
+                 groups=1,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 embed_dims=None,
+                 k=2, # the number of dyrelu-params
+                 coefs=[1.0, 0.5],
+                 consts=[1.0, 0.0],
+                 reduce=4,
+                 use_dyrelu=False,
+                 is_lite=False):
+        super(BottleNeck, self).__init__(name_scope="BottleNeck")
+        self.is_lite = is_lite
+        self.use_dyrelu = use_dyrelu
+        assert use_dyrelu is False or (use_dyrelu is True and embed_dims is not None), \
+               "Error: Please make sure while the use_dyrelu is True,"+\
+               " embed_dims(now:{0})>0.".format(embed_dims)
+
+        self.in_pw = PointWiseConv(in_channels=in_channels,
+                                   out_channels=hidden_channels,
+                                   groups=groups)
+        self.in_pw_bn = nn.BatchNorm2D(hidden_channels)
+        self.dw = DepthWiseConv(in_channels=hidden_channels,
+                                kernel_size=kernel_size,
+                                stride=stride,
+                                padding=padding,
+                                is_lite=is_lite)
+        self.dw_bn = nn.BatchNorm2D(hidden_channels)
+        self.out_pw = PointWiseConv(in_channels=hidden_channels,
+                                    out_channels=out_channels,
+                                    groups=groups)
+        self.out_pw_bn = nn.BatchNorm2D(out_channels)
+
+        if use_dyrelu is False:
+            self.act = nn.ReLU()
+        else:
+            self.act = DyReLU(in_channels=hidden_channels,
+                              embed_dims=embed_dims,
+                              k=k,
+                              coefs=coefs,
+                              consts=consts,
+                              reduce=reduce)
+
+    def forward(self, feature_map, tokens):
+        x = self.in_pw(feature_map)
+        x = self.in_pw_bn(x)
+        if self.use_dyrelu:
+            x = self.act(x, tokens)
+
+        x = self.dw(x)
+        x = self.dw_bn(x)
+        if self.use_dyrelu:
+            x = self.act(x, tokens)
+
+        x = self.out_pw(x)
+        x = self.out_pw_bn(x)
+
+        return x
+
+
+class Classifier_Head(nn.Layer):
+    """Classifier Head
+        Params Info:
+            in_channels: input feature map channels
+            embed_dims: input token embed_dims
+            hidden_features: the fc layer hidden feature size
+            num_classes: the number of classes
+    """
+    def __init__(self,
+                 in_channels,
+                 embed_dims,
+                 hidden_features,
+                 num_classes=1000,
+                 act=nn.Hardswish):
+        super(Classifier_Head, self).__init__(name_scope="Classifier_Head")
+        linear_weight_attr, linear_bias_attr = self._linear_init()
+        self.avg_pool = nn.AdaptiveAvgPool2D(output_size=1)
+        self.flatten = nn.Flatten()
+        self.fc1 = nn.Linear(in_features=in_channels+embed_dims,
+                             out_features=hidden_features,
+                             weight_attr=linear_weight_attr,
+                             bias_attr=linear_bias_attr)
+        linear_weight_attr, linear_bias_attr = self._linear_init()
+        self.fc2 = nn.Linear(in_features=hidden_features,
+                             out_features=num_classes,
+                             weight_attr=linear_weight_attr,
+                             bias_attr=linear_bias_attr)
+        self.act = act()
+        self.softmax = nn.Softmax()
+
+    def _linear_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def forward(self, feature_map, tokens):
+        x = self.avg_pool(feature_map) # B, C, 1, 1
+        x = self.flatten(x) # B, C
+
+        z = tokens[:, 0] # B, 1, D
+        x = paddle.concat([x, z], axis=-1)
+
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        x = self.softmax(x)
+
+        return x
+
+
+class Mobile(nn.Layer):
+    """Mobile Sub-block
+        Params Info:
+            in_channels: input feature map channels
+            hidden_channels: the dw layer hidden channel size
+            groups: the number of groups, by 1x1conv
+            embed_dims: input token embed_dims
+            k: the number of parameters is in Dynamic ReLU
+            coefs: the init value of coefficient parameters
+            consts: the init value of constant parameters
+            reduce: the mlp hidden scale,
+                    means 1/reduce = mlp_ratio
+            use_dyrelu: whether use dyrelu
+    """
+    def __init__(self,
+                 in_channels,
+                 hidden_channels,
+                 out_channels,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 groups=1,
+                 embed_dims=None,
+                 k=2,
+                 coefs=[1.0, 0.5],
+                 consts=[1.0, 0.0],
+                 reduce=4,
+                 use_dyrelu=False):
+        super(Mobile, self).__init__(name_scope="Mobile")
+        self.add_dw = (stride == 2)
+        self.bneck = BottleNeck(in_channels=in_channels,
+                                hidden_channels=hidden_channels,
+                                out_channels=out_channels,
+                                kernel_size=kernel_size,
+                                stride=1,
+                                padding=1,
+                                groups=groups,
+                                embed_dims=embed_dims,
+                                k=k,
+                                coefs=coefs,
+                                consts=consts,
+                                reduce=reduce,
+                                use_dyrelu=use_dyrelu)
+
+        if self.add_dw: # stride==2
+            self.downsample_dw = nn.Sequential(
+                DepthWiseConv(in_channels=in_channels,
+                              kernel_size=kernel_size,
+                              stride=stride,
+                              padding=padding),
+                nn.BatchNorm2D(in_channels)
+                #, nn.ReLU()
+            )
+
+    def forward(self, feature_map, tokens):
+        if self.add_dw:
+            feature_map = self.downsample_dw(feature_map)
+
+        x = self.bneck(feature_map, tokens)
+        return x
+
+
+class ToFormer_Bridge(nn.Layer):
+    """Mobile to Former Bridge
+        Params Info:
+            in_channels: input feature map channels
+            embed_dims: input token embed_dims
+            num_head: the number of head is in multi head attention
+            dropout_rate: the dropout rate of attention result
+            attn_dropout_rate: the dropout rate of attention distribution
+    """
+    def __init__(self,
+                 embed_dims,
+                 in_channels,
+                 num_head=1,
+                 dropout_rate=0.,
+                 attn_dropout_rate=0.):
+        super(ToFormer_Bridge, self).__init__(name_scope="ToFormer_Bridge")
+        self.num_head = num_head
+        self.head_dims = in_channels // num_head
+        self.scale = self.head_dims ** -0.5
+        linear_weight_attr, linear_bias_attr = self._linear_init()
+        # split head to project
+        self.heads_q_proj = []
+        for i in range(num_head): # n linear
+            self.heads_q_proj.append(
+                nn.Linear(in_features=embed_dims // num_head,
+                          out_features=self.head_dims,
+                          weight_attr=linear_weight_attr,
+                          bias_attr=linear_bias_attr)
+            )
+        self.heads_q_proj = nn.LayerList(self.heads_q_proj)
+
+        self.output = nn.Linear(in_features=self.num_head*self.head_dims,
+                                out_features=embed_dims,
+                                weight_attr=linear_weight_attr,
+                                bias_attr=linear_bias_attr)
+
+        self.softmax = nn.Softmax()
+        self.dropout = nn.Dropout(dropout_rate)
+        self.attn_dropout = nn.Dropout(attn_dropout_rate)
+
+    def _linear_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def transfer_shape(self, feature_map, tokens):
+        B, C, H, W = feature_map.shape
+        assert C % self.num_head == 0, \
+            "Erorr: Please make sure feature_map.channels % "+\
+            "num_head == 0(now:{0}).".format(C % self.num_head)
+        fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
+        fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
+        fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
+        fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d
+
+        B, M, D = tokens.shape
+        h_token = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
+        h_token = h_token.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
+
+        return fm, h_token
+
+    def _multi_head_q_forward(self, token, B, M):
+        q_list = []
+        for i in range(self.num_head):
+            q_list.append(
+                # B, 1, M, head_dims
+                self.heads_q_proj[i](token[:, i, :, :]).reshape([B, 1, M, self.head_dims])
+            )
+        q = paddle.concat(q_list, axis=1) # B, num_head, M, head_dims
+        return q
+
+    def forward(self, feature_map, tokens):
+        B, M, D = tokens.shape
+        # fm（key/value） to shape: B, n_h, L, h_d
+        # token to shape: B, n_h, M, D // n_h
+        fm, token = self.transfer_shape(feature_map, tokens)
+
+        q = self._multi_head_q_forward(token, B, M)
+
+        # attention distribution
+        attn = paddle.matmul(q, fm, transpose_y=True) # B, n_h, M, L
+        attn = attn * self.scale
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        # attention result
+        z = paddle.matmul(attn, fm) # B, n_h, M, h_d
+        z = z.transpose(perm=[0, 2, 1, 3])
+        z = z.reshape(shape=[B, M, self.num_head*self.head_dims])
+        z = self.output(z) # B, M, D
+        z = self.dropout(z)
+        z = z + tokens
+
+        return z
+
+
+class ToMobile_Bridge(nn.Layer):
+    """Former to Mobile Bridge
+        Params Info:
+            in_channels: input feature map channels
+            embed_dims: input token embed_dims
+            num_head: the number of head is in multi head attention
+            dropout_rate: the dropout rate of attention result
+            attn_dropout_rate: the dropout rate of attention distribution
+    """
+    def __init__(self,
+                 embed_dims,
+                 in_channels,
+                 num_head=1,
+                 dropout_rate=0.,
+                 attn_dropout_rate=0.):
+        super(ToMobile_Bridge, self).__init__(name_scope="ToMobile_Bridge")
+        self.num_head = num_head
+        self.head_dims = in_channels // num_head
+        self.scale = self.head_dims ** -0.5
+
+        linear_weight_attr, linear_bias_attr = self._linear_init()
+
+        self.heads_k_proj = []
+        self.heads_v_proj = []
+        for i in range(num_head): # n linear
+            self.heads_k_proj.append(
+                nn.Linear(in_features=embed_dims // num_head,
+                          out_features=self.head_dims,
+                          weight_attr=linear_weight_attr,
+                          bias_attr=linear_bias_attr)
+            )
+            self.heads_v_proj.append(
+                nn.Linear(in_features=embed_dims // num_head,
+                          out_features=self.head_dims,
+                          weight_attr=linear_weight_attr,
+                          bias_attr=linear_bias_attr)
+            )
+        self.heads_k_proj = nn.LayerList(self.heads_k_proj)
+        self.heads_v_proj = nn.LayerList(self.heads_v_proj)
+
+        self.softmax = nn.Softmax()
+        self.dropout = nn.Dropout(dropout_rate)
+        self.attn_dropout = nn.Dropout(attn_dropout_rate)
+
+    def _linear_init(self):
+        weight_attr = nn.initializer.KaimingNormal()
+        bias_attr = nn.initializer.Constant(value=0.0)
+        return weight_attr, bias_attr
+
+    def transfer_shape(self, feature_map, tokens):
+        B, C, H, W = feature_map.shape
+        assert C % self.num_head == 0, \
+            "Erorr: Please make sure feature_map.channels % "+\
+            "num_head == 0(now:{0}).".format(C % self.num_head)
+        fm = feature_map.reshape(shape=[B, C, H*W]) # B, C, L
+        fm = fm.transpose(perm=[0, 2, 1]) # B, L, C -- C = num_head * head_dims
+        fm = fm.reshape(shape=[B, H*W, self.num_head, self.head_dims])
+        fm = fm.transpose(perm=[0, 2, 1, 3]) # B, n_h, L, h_d
+
+        B, M, D = tokens.shape
+        k = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
+        k = k.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
+        v = tokens.reshape(shape=[B, M, self.num_head, D // self.num_head])
+        v = v.transpose(perm=[0, 2, 1, 3]) # B, n_h, M, D // n_h
+
+        return fm, k, v
+
+    def _multi_head_kv_forward(self, k_, v_, B, M):
+        k_list = []
+        v_list = []
+        for i in range(self.num_head):
+            k_list.append(
+                # B, 1, M, head_dims
+                self.heads_k_proj[i](k_[:, i, :, :]).reshape([B, 1, M, self.head_dims])
+            )
+            v_list.append(
+                # B, 1, M, head_dims
+                self.heads_v_proj[i](v_[:, i, :, :]).reshape([B, 1, M, self.head_dims])
+            )
+        k = paddle.concat(k_list, axis=1) # B, num_head, M, head_dims
+        v = paddle.concat(v_list, axis=1) # B, num_head, M, head_dims
+        return k, v
+
+    def forward(self, feature_map, tokens):
+        B, C, H, W = feature_map.shape
+        B, M, D = tokens.shape
+
+        # fm（q） to shape: B, n_h, L, h_d
+        # k/v to shape: B, n_h, M, D // n_h
+        q, k_, v_ = self.transfer_shape(feature_map, tokens)
+
+        k, v = self._multi_head_kv_forward(k_, v_, B, M)
+
+        # attention distribution
+        attn = paddle.matmul(q, k, transpose_y=True) # B, n_h, L, M
+        attn = attn * self.scale
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        # attention result
+        z = paddle.matmul(attn, v) # B, n_h, L, h_d
+        z = z.transpose(perm=[0, 1, 3, 2]) # B, n_h, h_d, L
+        # B, n_h*h_d, H, W
+        z = z.reshape(shape=[B, self.num_head*self.head_dims, H, W])
+        z = self.dropout(z)
+        z = z + feature_map
+
+        return z
+
+
+class Former(nn.Layer):
+    """Former Sub-block
+        Params Info:
+            embed_dims: input token embed_dims
+            num_head: the number of head is in multi head attention
+            mlp_ratio: the scale of hidden feature size
+            dropout_rate: the dropout rate of attention result
+            droppath_rate: the droppath rate of attention output
+            attn_dropout_rate: the dropout rate of attention distribution
+            mlp_dropout_rate: the dropout rate of mlp layer output
+            qkv_bias: whether use the bias in qkv matrix
+    """
+    def __init__(self,
+                 embed_dims,
+                 num_head=1,
+                 mlp_ratio=2,
+                 dropout_rate=0.,
+                 droppath_rate=0.,
+                 attn_dropout_rate=0.,
+                 mlp_dropout_rate=0.,
+                 norm=nn.LayerNorm,
+                 act=nn.GELU,
+                 qkv_bias=True):
+        super(Former, self).__init__(name_scope="Former")
+
+        self.attn = Attention(embed_dims=embed_dims,
+                              num_head=num_head,
+                              dropout_rate=dropout_rate,
+                              attn_dropout_rate=attn_dropout_rate,
+                              qkv_bias=qkv_bias)
+        self.attn_ln = norm(embed_dims)
+        self.attn_droppath = DropPath(droppath_rate)
+
+        self.mlp = MLP(in_features=embed_dims,
+                       mlp_ratio=mlp_ratio,
+                       mlp_dropout_rate=mlp_dropout_rate,
+                       act=act)
+        self.mlp_ln = norm(embed_dims)
+        self.mlp_droppath = DropPath(droppath_rate)
+
+    def forward(self, inputs):
+        res = inputs
+        x = self.attn(inputs)
+        x = self.attn_ln(x)
+        x = self.attn_droppath(x)
+        x = x + res
+
+        res = x
+        x = self.mlp(x)
+        x = self.mlp_ln(x)
+        x = self.mlp_droppath(x)
+        x = x + res
+
+        return x
+
+
+class MFBlock(nn.Layer):
+    """MobileFormer Basic Block
+        Params Info:
+            in_channels: the number of input feature map channel
+            hidden_channels: the number of hidden(dw_conv) feature map channel
+            out_channels: the number of output feature map channel
+            embed_dims: input token embed_dims
+            num_head: the number of head is in multi head attention
+            groups: the number of groups in 1x1 conv
+            k: the number of parameters is in Dynamic ReLU
+            coefs: the init value of coefficient parameters
+            consts: the init value of constant parameters
+            reduce: the mlp hidden scale,
+                    means 1/reduce = mlp_ratio
+            use_dyrelu: whether use dyrelu
+            mlp_ratio: the scale of hidden feature size
+            dropout_rate: the dropout rate of attention result
+            droppath_rate: the droppath rate of attention output
+            attn_dropout_rate: the dropout rate of attention distribution
+            mlp_dropout_rate: the dropout rate of mlp layer output
+            qkv_bias: whether use the bias in qkv matrix
+    """
+    def __init__(self,
+                 in_channels,
+                 hidden_channels,
+                 out_channels,
+                 embed_dims,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 groups=1,
+                 k=2,
+                 coefs=[1.0, 0.5],
+                 consts=[1.0, 0.0],
+                 reduce=4,
+                 use_dyrelu=False,
+                 num_head=1,
+                 mlp_ratio=2,
+                 dropout_rate=0.,
+                 droppath_rate=0.,
+                 attn_dropout_rate=0.,
+                 mlp_dropout_rate=0.,
+                 norm=nn.LayerNorm,
+                 act=nn.GELU,
+                 qkv_bias=True):
+        super(MFBlock, self).__init__(name_scope="MFBlock")
+        self.mobile = Mobile(in_channels=in_channels,
+                             hidden_channels=hidden_channels,
+                             out_channels=out_channels,
+                             kernel_size=kernel_size,
+                             stride=stride,
+                             padding=padding,
+                             groups=groups,
+                             embed_dims=embed_dims,
+                             k=k,
+                             coefs=coefs,
+                             consts=consts,
+                             reduce=reduce,
+                             use_dyrelu=use_dyrelu)
+
+        self.toformer_bridge = ToFormer_Bridge(embed_dims=embed_dims,
+                                               in_channels=in_channels,
+                                               num_head=num_head,
+                                               dropout_rate=dropout_rate,
+                                               attn_dropout_rate=attn_dropout_rate)
+        self.toformer_norm = norm(embed_dims)
+
+        self.former = Former(embed_dims=embed_dims,
+                             num_head=num_head,
+                             mlp_ratio=mlp_ratio,
+                             dropout_rate=droppath_rate,
+                             mlp_dropout_rate=mlp_dropout_rate,
+                             attn_dropout_rate=attn_dropout_rate,
+                             droppath_rate=droppath_rate,
+                             norm=norm,
+                             act=act)
+
+        self.tomobile_bridge = ToMobile_Bridge(in_channels=out_channels,
+                                               embed_dims=embed_dims,
+                                               num_head=num_head,
+                                               dropout_rate=dropout_rate,
+                                               attn_dropout_rate=attn_dropout_rate)
+        self.tomobile_norm = nn.BatchNorm2D(out_channels)
+
+
+    def forward(self, feature_map, tokens):
+        z_h = self.toformer_bridge(feature_map, tokens)
+        z_h = self.toformer_norm(z_h)
+        z_out = self.former(z_h)
+
+        f_h = self.mobile(feature_map, z_out)
+        f_out = self.tomobile_bridge(f_h, z_out)
+        f_out = self.tomobile_norm(f_out)
+
+        return f_out, z_out
+
+
+class MobileFormer(nn.Layer):
+    """MobileFormer
+        Params Info:
+            num_classes: the number of classes
+            in_channels: the number of input feature map channel
+            tokens: the shape of former token
+            num_head: the number of head is in multi head attention
+            groups: the number of groups in 1x1 conv
+            k: the number of parameters is in Dynamic ReLU
+            coefs: the init value of coefficient parameters
+            consts: the init value of constant parameters
+            reduce: the mlp hidden scale,
+                    means 1/reduce = mlp_ratio
+            use_dyrelu: whether use dyrelu
+            mlp_ratio: the scale of hidden feature size
+            dropout_rate: the dropout rate of attention result
+            droppath_rate: the droppath rate of attention output
+            attn_dropout_rate: the dropout rate of attention distribution
+            mlp_dropout_rate: the dropout rate of mlp layer output
+            alpha: the scale of model size
+            qkv_bias: whether use the bias in qkv matrix
+            config: total model config
+    """
+    def __init__(self, num_classes=1000, in_channels=3,
+                 tokens=[3, 128], num_head=4, mlp_ratio=2,
+                 use_dyrelu=True, k=2, reduce=4.0,
+                 coefs=[1.0, 0.5], consts=[1.0, 0.0],
+                 dropout_rate=0, droppath_rate=0,
+                 attn_dropout_rate=0, mlp_dropout_rate=0,
+                 norm=nn.LayerNorm, act=nn.GELU,
+                 alpha=1.0, qkv_bias=True,
+                 config=None):
+        super(MobileFormer, self).__init__()
+        self.num_token, self.embed_dims = tokens[0], tokens[1]
+        self.num_head = num_head
+        self.num_classes = num_classes
+        self.in_channels = in_channels
+        self.mlp_ratio = mlp_ratio
+        self.alpha = alpha
+        self.qkv_bias = qkv_bias
+        self.dropout_rate = dropout_rate
+        self.droppath_rate = droppath_rate
+        self.attn_dropout_rate = attn_dropout_rate
+        self.mlp_dropout_rate = mlp_dropout_rate
+
+        assert config is not None, \
+            "Error: Please enter the config(now: {0})".format(config)+\
+            " in the __init__."
+
+        # create learnable tokens: self.tokens
+        self._create_token(num_token=self.num_token,
+                           embed_dims=self.embed_dims)
+
+        # create total model
+        self._create_model(use_dyrelu=use_dyrelu,
+                           reduce=reduce, dyrelu_k=k,
+                           coefs=coefs, consts=consts,
+                           alpha=alpha, norm=norm, act=act,
+                           config=config)
+
+    def _create_token(self, num_token, embed_dims):
+        # B(1), token_size, embed_dims
+        shape = [1] + [num_token, embed_dims]
+        self.tokens = self.create_parameter(shape=shape, dtype='float32')
+
+    def _create_stem(self,
+                     in_channels,
+                     out_channels,
+                     kernel_size,
+                     stride, padding,
+                     alpha):
+        self.stem = Stem(in_channels=in_channels,
+                         out_channels=int(alpha * out_channels),
+                         kernel_size=kernel_size,
+                         stride=stride,
+                         padding=padding)
+
+    def _create_lite_bneck(self,
+                           in_channels,
+                           hidden_channels,
+                           out_channels,
+                           kernel_size,
+                           stride,
+                           padding,
+                           alpha,
+                           pointwiseconv_groups):
+        self.bneck_lite = BottleNeck(in_channels=int(alpha * in_channels),
+                                     hidden_channels=int(alpha * hidden_channels),
+                                     out_channels=int(alpha * out_channels),
+                                     groups=pointwiseconv_groups,
+                                     kernel_size=kernel_size,
+                                     stride=stride,
+                                     padding=padding,
+                                     use_dyrelu=False,
+                                     is_lite=True)
+
+    def _create_mf_blocks(self,
+                          in_channel_list,
+                          hidden_channel_list,
+                          out_channel_list,
+                          kernel_list,
+                          stride_list,
+                          padding_list,
+                          alpha,
+                          use_dyrelu,
+                          reduce,
+                          dyrelu_k,
+                          coefs,
+                          consts,
+                          norm,
+                          act,
+                          pointwiseconv_groups):
+        self.blocks = []
+        for i, _ in enumerate(in_channel_list):
+            self.blocks.append(
+                MFBlock(
+                    in_channels=int(alpha * in_channel_list[i]),
+                    hidden_channels=int(alpha * hidden_channel_list[i]),
+                    out_channels=int(alpha * out_channel_list[i]),
+                    embed_dims=self.embed_dims,
+                    kernel_size=kernel_list[i],
+                    stride=stride_list[i],
+                    padding=padding_list[i],
+                    groups=pointwiseconv_groups,
+                    k=dyrelu_k,
+                    coefs=coefs,
+                    consts=consts,
+                    reduce=reduce,
+                    use_dyrelu=use_dyrelu,
+                    num_head=self.num_head,
+                    mlp_ratio=self.mlp_ratio,
+                    dropout_rate=self.dropout_rate,
+                    droppath_rate=self.droppath_rate,
+                    attn_dropout_rate=self.attn_dropout_rate,
+                    mlp_dropout_rate=self.mlp_dropout_rate,
+                    norm=norm,
+                    act=act
+                )
+            )
+        self.blocks = nn.LayerList(self.blocks)
+
+    def _create_former_end_bridge(self,
+                                  in_channels,
+                                  norm,
+                                  alpha):
+        self.end_toformer_bridge = ToFormer_Bridge(
+            embed_dims=self.embed_dims,
+            in_channels=int(alpha * in_channels),
+            num_head=self.num_head,
+            dropout_rate=self.dropout_rate,
+            attn_dropout_rate=self.attn_dropout_rate)
+        self.former_bridge_norm = norm(self.embed_dims)
+
+    def _create_channel_conv(self,
+                             in_channels,
+                             out_channels,
+                             alpha,
+                             pointwiseconv_groups):
+        self.channel_conv = nn.Sequential(
+            PointWiseConv(in_channels=int(alpha * in_channels),
+                          out_channels=out_channels,
+                          groups=pointwiseconv_groups),
+            nn.BatchNorm2D(out_channels),
+            nn.ReLU()
+        )
+
+    def _create_head(self,
+                     in_channels,
+                     hidden_features):
+        self.head = Classifier_Head(in_channels=in_channels,
+                                    embed_dims=self.embed_dims,
+                                    hidden_features=hidden_features,
+                                    num_classes=self.num_classes)
+
+    def _create_model(self,
+                      use_dyrelu,
+                      reduce,
+                      dyrelu_k,
+                      coefs,
+                      consts,
+                      norm,
+                      act,
+                      alpha,
+                      config):
+        # create stem: self.stem
+        self._create_stem(in_channels=self.in_channels,
+                          out_channels=config.MODEL.MF.STEM.OUT_CHANNELS,
+                          kernel_size=config.MODEL.MF.STEM.KERNELS,
+                          stride=config.MODEL.MF.STEM.STRIEDS,
+                          padding=config.MODEL.MF.STEM.PADDINGS,
+                          alpha=alpha)
+        # create lite-bottleneck: self.bneck_lite
+        self._create_lite_bneck(in_channels=config.MODEL.MF.LITE_BNECK.IN_CHANNEL,
+                                hidden_channels=config.MODEL.MF.LITE_BNECK.HIDDEN_CHANNEL,
+                                out_channels=config.MODEL.MF.LITE_BNECK.OUT_CHANNEL,
+                                kernel_size=config.MODEL.MF.LITE_BNECK.KERNEL,
+                                stride=config.MODEL.MF.LITE_BNECK.STRIED,
+                                padding=config.MODEL.MF.LITE_BNECK.PADDING,
+                                alpha=alpha,
+                                pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
+        # create mobileformer blocks: self.blocks
+        self._create_mf_blocks(in_channel_list=config.MODEL.MF.BLOCK.IN_CHANNELS,
+                               hidden_channel_list=config.MODEL.MF.BLOCK.HIDDEN_CHANNELS,
+                               out_channel_list=config.MODEL.MF.BLOCK.OUT_CHANNELS,
+                               kernel_list=config.MODEL.MF.BLOCK.KERNELS,
+                               stride_list=config.MODEL.MF.BLOCK.STRIEDS,
+                               padding_list=config.MODEL.MF.BLOCK.PADDINGS,
+                               alpha=alpha,
+                               use_dyrelu=use_dyrelu,
+                               reduce=reduce,
+                               dyrelu_k=dyrelu_k,
+                               coefs=coefs,
+                               consts=consts,
+                               norm=norm,
+                               act=act,
+                               pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
+        # create final toformer_bridge: self.toformer_bridge
+        self._create_former_end_bridge(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
+                                       norm=norm,
+                                       alpha=alpha)
+        # create channel 1x1 conv: self.channel_conv
+        self._create_channel_conv(in_channels=config.MODEL.MF.CHANNEL_CONV.IN_CHANNEL,
+                                  out_channels=config.MODEL.MF.CHANNEL_CONV.OUT_CHANNEL,
+                                  alpha=alpha,
+                                  pointwiseconv_groups=config.MODEL.MF.POINTWISECONV_GROUPS)
+        # create classifier head: self.head
+        self._create_head(in_channels=config.MODEL.MF.HEAD.IN_CHANNEL,
+                          hidden_features=config.MODEL.MF.HEAD.HIDDEN_FEATURE)
+
+    def _to_batch_tokens(self, batch_size):
+        # B, token_size, embed_dims
+        return paddle.concat([self.tokens] * batch_size, axis=0)
+
+    def bridge_forward(self, inputs):
+        B, _, _, _ = inputs.shape
+        feature_map = self.stem(inputs)
+        # create batch tokens
+        tokens = self._to_batch_tokens(B) # B, token_size, embed_dims
+        feature_map = self.bneck_lite(feature_map, tokens)
+
+        for b in self.blocks:
+            feature_map, tokens = b(feature_map, tokens)
+
+        tokens = self.end_toformer_bridge(feature_map, tokens)
+        tokens = self.former_bridge_norm(tokens)
+
+        return feature_map, tokens
+
+    def forward(self, inputs):
+        feature_map, tokens = self.bridge_forward(inputs)
+
+        feature_map = self.channel_conv(feature_map)
+        output = self.head(feature_map, tokens)
+
+        return output
+
+
+def build_mformer(config):
+    """build model
+    """
+    model = MobileFormer(num_classes=config.MODEL.NUM_CLASSES,
+                         in_channels=config.MODEL.MF.IN_CHANNELS,
+                         tokens=config.MODEL.MF.TOKENS,
+                         num_head=config.MODEL.MF.NUM_HEAD,
+                         mlp_ratio=config.MODEL.MF.MLP_RATIO,
+                         k=config.MODEL.MF.DYRELU.DYRELU_K,
+                         reduce=config.MODEL.MF.DYRELU.REDUCE,
+                         coefs=config.MODEL.MF.DYRELU.COEFS,
+                         consts=config.MODEL.MF.DYRELU.CONSTS,
+                         use_dyrelu=config.MODEL.MF.DYRELU.USE_DYRELU,
+                         dropout_rate=config.MODEL.DROPOUT,
+                         droppath_rate=config.MODEL.DROPPATH,
+                         attn_dropout_rate=config.MODEL.ATTENTION_DROPOUT,
+                         mlp_dropout_rate=config.MODEL.MLP_DROPOUT,
+                         alpha=config.MODEL.MF.ALPHA,
+                         qkv_bias=config.MODEL.MF.QKV_BIAS,
+                         config=config)
+
+    return model
diff --git a/image_classification/MobileFormer/mobileformer_arch.png b/image_classification/MobileFormer/mobileformer_arch.png
new file mode 100644
index 00000000..8db374bd
Binary files /dev/null and b/image_classification/MobileFormer/mobileformer_arch.png differ
diff --git a/image_classification/MobileFormer/random_erasing.py b/image_classification/MobileFormer/random_erasing.py
new file mode 100644
index 00000000..162c512d
--- /dev/null
+++ b/image_classification/MobileFormer/random_erasing.py
@@ -0,0 +1,94 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input_):
+        if len(input_.shape) == 3:
+            self._erase(input_, *input_.shape, input_.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input_.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input_[i], chan, img_h, img_w, input_.dtype)
+        return input_
\ No newline at end of file
diff --git a/image_classification/MobileFormer/run_eval.sh b/image_classification/MobileFormer/run_eval.sh
new file mode 100644
index 00000000..a714da41
--- /dev/null
+++ b/image_classification/MobileFormer/run_eval.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/mobileformer_26m.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=64 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./mobileformer_26m'
\ No newline at end of file
diff --git a/image_classification/MobileFormer/run_eval_multi.sh b/image_classification/MobileFormer/run_eval_multi.sh
new file mode 100644
index 00000000..05a14933
--- /dev/null
+++ b/image_classification/MobileFormer/run_eval_multi.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/mobileformer_26m.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=32 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./mobileformer_26m'
\ No newline at end of file
diff --git a/image_classification/MobileFormer/run_train.sh b/image_classification/MobileFormer/run_train.sh
new file mode 100644
index 00000000..8ea4a3af
--- /dev/null
+++ b/image_classification/MobileFormer/run_train.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/mobileformer_26m.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=4 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -output='./output'
diff --git a/image_classification/MobileFormer/run_train_amp.sh b/image_classification/MobileFormer/run_train_amp.sh
new file mode 100644
index 00000000..254643f3
--- /dev/null
+++ b/image_classification/MobileFormer/run_train_amp.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/mobileformer_26m.yaml' \
+-dataset='imagenet2012' \
+-num_classes=1000 \
+-batch_size=4 \
+-image_size=224 \
+-data_path='/dataset/imagenet' \
+-output='./output' \
+-amp
\ No newline at end of file
diff --git a/image_classification/MobileFormer/run_train_multi.sh b/image_classification/MobileFormer/run_train_multi.sh
new file mode 100644
index 00000000..5890a787
--- /dev/null
+++ b/image_classification/MobileFormer/run_train_multi.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+    -cfg='./configs/mobileformer_26m.yaml' \
+    -dataset='imagenet2012' \
+    -num_classes=1000 \
+    -batch_size=256 \
+    -image_size=224 \
+    -data_path='/dataset/imagenet' \
+    -output='./output'
diff --git a/image_classification/MobileFormer/transforms.py b/image_classification/MobileFormer/transforms.py
new file mode 100644
index 00000000..296d7c8b
--- /dev/null
+++ b/image_classification/MobileFormer/transforms.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
\ No newline at end of file
diff --git a/image_classification/MobileFormer/utils.py b/image_classification/MobileFormer/utils.py
new file mode 100644
index 00000000..54bd3a13
--- /dev/null
+++ b/image_classification/MobileFormer/utils.py
@@ -0,0 +1,114 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
\ No newline at end of file
diff --git a/image_classification/MobileViT/README.md b/image_classification/MobileViT/README.md
new file mode 100644
index 00000000..4bffe115
--- /dev/null
+++ b/image_classification/MobileViT/README.md
@@ -0,0 +1,177 @@
+# MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, [arxiv](https://arxiv.org/abs/2110.02178) 
+
+PaddlePaddle training/validation code and pretrained models for **MobileViT**.
+
+The official apple implementation is [here](https://github.com/apple/ml-cvnets).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+
+<p align="center">
+<img src="./mobilevit.png" alt="drawing" width="100%" height="100%"/>
+    <h4 align="center">MobileViT Transformer Model Overview</h4>
+</p>
+
+### Update 
+* Update (2021-12-30): Add multi scale sampler DDP and update mobilevit_s model weights.
+* Update (2021-11-02): Pretrained model weights (mobilevit_s) is released.
+* Update (2021-11-02): Pretrained model weights (mobilevit_xs) is released.
+* Update (2021-10-29): Pretrained model weights (mobilevit_xxs) is released.
+* Update (2021-10-20): Initial code is released.
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| mobilevit_xxs   				| 70.31| 89.68 | 1.32M   | 0.44G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1l3L-_TxS3QisRUIb8ohcv318vrnrHnWA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KFZ5G834_-XXN33W67k8eg)(axpc) |
+| mobilevit_xs   				| 74.47| 92.02 | 2.33M   | 0.95G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1oRMA4pNs2Ba0LYDbPufC842tO4OFcgwq/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IP8S-S6ZAkiL0OEsiBWNkw)(hfhm) |
+| mobilevit_s   				| 76.74| 93.08 | 5.59M   | 1.88G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1ibkhsswGYWvZwIRjwfgNA4-Oo2stKi0m/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-rI6hiCHZaI7os2siFASNg)(34bg) |
+| mobilevit_s $\dag$  			| 77.83| 93.83 | 5.59M   | 1.88G   | 256        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1BztBJ5jzmqgDWfQk-FB_ywDWqyZYu2yG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19YepMAO-sveBOLA4aSjIEQ?pwd=92ic)(92ic) |
+
+
+
+> The results are evaluated on ImageNet2012 validation set.
+> 
+> All models are trained from scratch using PaddleViT.
+>
+> $\dag$ means model is trained from scratch using PaddleViT using multi scale sampler DDP.
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./mobilevit_xxs.pdparams`, to use the `mobilevit_xxs` model in python:
+```python
+from config import get_config
+from mobile_vit import build_mobile_vit as build_model
+import paddle
+# config files in ./configs/
+config = get_config('./configs/mobilevit_xxs.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./mobilevit_xxs.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate MobileViT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/mobilevit_xxs.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/mobilevit_xxs  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/mobilevit_xxs.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/mobilevit_xxs  # .pdparams is NOT needed
+```
+
+</details>
+
+
+## Training
+To train the MobileVit XXS model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_singel_gpu.py \
+  -cfg=./configs/mobilevit_xxs.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=32 \
+  -data_path=/path/to/dataset/imagenet/train
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/mobilevit_xxs.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{mehta2021mobilevit,
+  title={MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
+  author={Mehta, Sachin and Rastegari, Mohammad},
+  journal={arXiv preprint arXiv:2110.02178},
+  year={2021}
+}
+```
diff --git a/image_classification/MobileViT/__init__.py b/image_classification/MobileViT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/MobileViT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/MobileViT/augment.py b/image_classification/MobileViT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/MobileViT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/MobileViT/config.py b/image_classification/MobileViT/config.py
new file mode 100644
index 00000000..4bc7c431
--- /dev/null
+++ b/image_classification/MobileViT/config.py
@@ -0,0 +1,196 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'MobileViT'
+_C.MODEL.NAME = 'mobilevit'
+_C.MODEL.RESUME = None
+_C.MODEL.RESUME_EMA = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+
+_C.MODEL.PATCH_SIZE = 2 
+_C.MODEL.IN_CHANNELS = 3
+_C.MODEL.DIMS = [16, 32, 48, 48, 48, 64, 80, 96, 384] # conv + mobv2 block out dims 
+_C.MODEL.HIDDEN_DIMS = [96, 120, 133] # mobile vit block hidden dims 
+_C.MODEL.DEPTH = [2, 4, 3] 
+_C.MODEL.NUM_HEADS = 8 
+_C.MODEL.MLP_RATIO = 2.
+_C.MODEL.QKV_BIAS = True
+_C.MODEL.QK_SCALE = None
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 3
+_C.TRAIN.WEIGHT_DECAY = 0.01
+_C.TRAIN.BASE_LR = 0.002
+_C.TRAIN.WARMUP_START_LR = 0.0002
+_C.TRAIN.END_LR = 0.0002
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
+_C.TRAIN.MULTI_SCALE_SAMPLER_DDP = True # for mobilevit only
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# augmentation
+_C.AUG = CN()
+_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
+_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
+_C.AUG.RE_PROB = 0.25 # random earse prob
+_C.AUG.RE_MODE = 'pixel' # random earse mode
+_C.AUG.RE_COUNT = 1 # random earse count
+_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
+_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
+_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/MobileViT/configs/mobilevit_s.yaml b/image_classification/MobileViT/configs/mobilevit_s.yaml
new file mode 100644
index 00000000..52b73fbc
--- /dev/null
+++ b/image_classification/MobileViT/configs/mobilevit_s.yaml
@@ -0,0 +1,22 @@
+DATA:
+    IMAGE_SIZE: 256
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: MobileViT
+    NAME: mobilevit_s
+    DIMS: [16, 32, 64, 64, 64, 96, 128, 160, 640] # conv3x3 + mobile v2 block output channels
+    HIDDEN_DIMS: [144, 192, 240] # mobile vit block hidden dim
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 30
+    BASE_LR: 0.002
+    WARMUP_START_LR: 0.0002
+    END_LR: 0.0002
+    WEIGHT_DECAY: 0.01
+    RANDOM_ERASE_PROB: 0.0
+    AUTO_AUGMENT: False
+    MIXUP_PROB: 0.0
+    CUTMIX_ALPHA: 0.0
+    CUTMIX_MINMAX: None
+    LINEAR_SCALED_LR: None
+    ACCUM_ITER: 4
diff --git a/image_classification/MobileViT/configs/mobilevit_xs.yaml b/image_classification/MobileViT/configs/mobilevit_xs.yaml
new file mode 100644
index 00000000..f73ed5e9
--- /dev/null
+++ b/image_classification/MobileViT/configs/mobilevit_xs.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 256
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: MobileViT
+    NAME: mobilevit_xs
+    DIMS: [16, 32, 48, 48, 48, 64, 80, 96, 384] # mobile v2 block output channels
+    HIDDEN_DIMS: [96, 120, 144] # mobile vit block hidden dim
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 30
+    BASE_LR: 0.002
+    WARMUP_START_LR: 0.0002
+    END_LR: 0.0002
+    WEIGHT_DECAY: 0.01
+    RANDOM_ERASE_PROB: 0.0
+    AUTO_AUGMENT: False
+    MIXUP_PROB: 0.0
+    CUTMIX_ALPHA: 0.0
+    CUTMIX_MINMAX: None
+    LINEAR_SCALED_LR: None
diff --git a/image_classification/MobileViT/configs/mobilevit_xxs.yaml b/image_classification/MobileViT/configs/mobilevit_xxs.yaml
new file mode 100644
index 00000000..2ab3de45
--- /dev/null
+++ b/image_classification/MobileViT/configs/mobilevit_xxs.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 256
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: MobileViT
+    NAME: mobilevit_xxs
+    DIMS: [16, 16, 24, 24, 24, 48, 64, 80, 320] # mobile v2 block output channels
+    HIDDEN_DIMS: [64, 80, 96] # mobile vit block hidden dim
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 30
+    BASE_LR: 0.002
+    WARMUP_START_LR: 0.0002
+    END_LR: 0.0002
+    WEIGHT_DECAY: 0.01
+    RANDOM_ERASE_PROB: 0.0
+    AUTO_AUGMENT: False
+    MIXUP_PROB: 0.0
+    CUTMIX_ALPHA: 0.0
+    CUTMIX_MINMAX: None
+    LINEAR_SCALED_LR: None
diff --git a/image_classification/MobileViT/datasets.py b/image_classification/MobileViT/datasets.py
new file mode 100644
index 00000000..00fb9294
--- /dev/null
+++ b/image_classification/MobileViT/datasets.py
@@ -0,0 +1,259 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+from multi_scale_sampler import MultiScaleSamplerDDP
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        if isinstance(index, (list, tuple)):
+            w, h, idx = index
+            w = int(w)
+            h = int(h)
+            data = Image.open(self.img_path_list[idx]).convert('RGB')
+            data = self.transform(data, image_size=(w, h))
+            label = self.label_list[idx]
+        else:
+            data = Image.open(self.img_path_list[index]).convert('RGB')
+            data = self.transform(data)
+            label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(RandomHorizontalFlip(0.5))
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = Compose(aug_op_list)
+    #transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    #transforms_val = transforms.Compose([
+    transforms_val = Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+class Compose():
+    def __init__(self, transforms):
+        self.transforms = transforms
+
+    def __call__(self, image, image_size=None):
+        if image_size is not None:
+            if isinstance(self.transforms[1], transforms.RandomResizedCrop):
+                self.transforms[1] = transforms.RandomResizedCrop(
+                    image_size, scale=(0.05, 1.0), interpolation='bicubic')
+        for t in self.transforms:
+            image = t(image)
+        return image
+
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        if config.TRAIN.MULTI_SCALE_SAMPLER_DDP:
+            sampler = MultiScaleSamplerDDP(base_image_width=config.DATA.IMAGE_SIZE,
+                                           base_image_height=config.DATA.IMAGE_SIZE,
+                                           base_batch_size=batch_size,
+                                           num_data_samples=len(dataset),
+                                           is_training=(mode == 'train'))
+            dataloader = DataLoader(dataset,
+                                    batch_sampler=sampler,
+                                    num_workers=config.DATA.NUM_WORKERS)
+        else:
+            sampler = DistributedBatchSampler(dataset,
+                                              batch_size=batch_size,
+                                              shuffle=(mode == 'train'))
+            dataloader = DataLoader(dataset,
+                                    batch_sampler=sampler,
+                                    num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/MobileViT/droppath.py b/image_classification/MobileViT/droppath.py
new file mode 100644
index 00000000..08472aea
--- /dev/null
+++ b/image_classification/MobileViT/droppath.py
@@ -0,0 +1,62 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    for i in range(100):
+#        out = dp(tmp)
+#        print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/MobileViT/losses.py b/image_classification/MobileViT/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/MobileViT/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/MobileViT/main_multi_gpu.py b/image_classification/MobileViT/main_multi_gpu.py
new file mode 100644
index 00000000..472e0392
--- /dev/null
+++ b/image_classification/MobileViT/main_multi_gpu.py
@@ -0,0 +1,608 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MobileViT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from mobile_vit import build_mobile_vit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('MobileViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MobileViT/main_single_gpu.py b/image_classification/MobileViT/main_single_gpu.py
new file mode 100644
index 00000000..9d4e95e9
--- /dev/null
+++ b/image_classification/MobileViT/main_single_gpu.py
@@ -0,0 +1,450 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MobileViT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from mobile_vit import build_mobile_vit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('MobileViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None:
+            model_ema.update(model)
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/MobileViT/mixup.py b/image_classification/MobileViT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/MobileViT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/MobileViT/mobile_vit.py b/image_classification/MobileViT/mobile_vit.py
new file mode 100644
index 00000000..fdd5971d
--- /dev/null
+++ b/image_classification/MobileViT/mobile_vit.py
@@ -0,0 +1,406 @@
+import paddle
+import paddle.nn as nn
+
+
+def _init_weights_linear():
+    weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+    bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
+    return weight_attr, bias_attr
+
+
+def _init_weights_layernorm():
+    weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+    bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
+    return weight_attr, bias_attr
+
+
+# DONE
+class ConvNormAct(nn.Layer):
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 bias_attr=False,
+                 groups=1):
+        super().__init__()
+        self.conv = nn.Conv2D(in_channels=in_channels,
+                              out_channels=out_channels,
+                              kernel_size=kernel_size,
+                              stride=stride,
+                              padding=padding,
+                              groups=groups,
+                              weight_attr=paddle.ParamAttr(initializer=nn.initializer.KaimingUniform()),
+                              bias_attr=bias_attr)
+        self.norm = nn.BatchNorm2D(out_channels)
+        self.act = nn.Silu()
+
+    def forward(self, inputs):
+        out = self.conv(inputs)
+        out = self.norm(out)
+        out = self.act(out)
+        return out
+
+
+# DONE
+class Identity(nn.Layer):
+    """ Identity layer"""
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, inputs):
+        return inputs
+
+
+#DONE
+class Mlp(nn.Layer):
+    def __init__(self,
+                 embed_dim,
+                 mlp_ratio,
+                 dropout=0.):
+        super().__init__()
+        w_attr_1, b_attr_1 = _init_weights_linear()
+        self.fc1 = nn.Linear(embed_dim,
+                             int(embed_dim * mlp_ratio),
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = _init_weights_linear()
+        self.fc2 = nn.Linear(int(embed_dim * mlp_ratio),
+                             embed_dim,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+
+        self.act = nn.Silu()
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout1(x)
+        x = self.fc2(x)
+        x = self.dropout2(x)
+        return x
+
+
+class Attention(nn.Layer):
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 qkv_bias=True,
+                 dropout=0.,
+                 attention_dropout=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        self.attn_head_dim = int(embed_dim / self.num_heads)
+        self.all_head_dim = self.attn_head_dim * self.num_heads
+        
+        w_attr_1, b_attr_1 = _init_weights_linear()
+        self.qkv = nn.Linear(embed_dim,
+                             self.all_head_dim * 3, # weights for q, k, v
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else False)
+
+        self.scales = self.attn_head_dim ** -0.5
+
+        w_attr_2, b_attr_2 = _init_weights_linear()
+        self.proj = nn.Linear(embed_dim,
+                              embed_dim,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
+
+        self.attn_dropout = nn.Dropout(attention_dropout)
+        self.proj_dropout = nn.Dropout(dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multihead(self, x):
+        # in_shape: [batch_size, P, N, hd]
+        B, P, N, d = x.shape
+        x = x.reshape([B, P, N, self.num_heads, -1])
+        x = x.transpose([0, 1, 3, 2, 4])
+        # out_shape: [batch_size, P, num_heads, N, d]
+        return x
+
+    def forward(self, x):
+        # [B, 2x2, 256, 96]: [B, P, N, d]
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multihead, qkv)
+
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = attn * self.scales
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+        # [batch_size, P, num_heads, N, N]
+
+        z = paddle.matmul(attn, v)
+        # [batch_size, P, num_heads, N, d]
+        z = z.transpose([0, 1, 3, 2, 4])
+        B, P, N, H, D = z.shape
+        z = z.reshape([B, P, N, H * D])
+        z = self.proj(z)
+        z = self.proj_dropout(z)
+        return z
+
+
+# DONE
+class EncoderLayer(nn.Layer):
+    def __init__(self,
+                 embed_dim,
+                 num_heads=8,
+                 qkv_bias=True,
+                 mlp_ratio=2.0,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        w_attr_1, b_attr_1 = _init_weights_layernorm()
+        w_attr_2, b_attr_2 = _init_weights_layernorm()
+
+        self.attn_norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1)
+        self.attn = Attention(embed_dim, num_heads, qkv_bias, attention_dropout, dropout)
+        self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
+        self.mlp_norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
+        self.mlp = Mlp(embed_dim, mlp_ratio, dropout)
+
+    def forward(self, x):
+        h = x
+        x = self.attn_norm(x)
+        x = self.attn(x)
+        x = self.drop_path(x)
+        x = h + x
+
+        h = x
+        x = self.mlp_norm(x)
+        x = self.mlp(x)
+        x = self.drop_path(x)
+        x = x + h
+        return x
+
+
+# DONE
+class Transformer(nn.Layer):
+    def __init__(self,
+                 embed_dim,
+                 num_heads,
+                 depth,
+                 qkv_bias=True,
+                 mlp_ratio=2.0,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.):
+        super().__init__()
+        depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
+
+        layer_list = []
+        for i in range(depth):
+            layer_list.append(EncoderLayer(embed_dim, 
+                                           num_heads,
+                                           qkv_bias,
+                                           mlp_ratio,
+                                           dropout,
+                                           attention_dropout,
+                                           droppath))
+        self.layers = nn.LayerList(layer_list)
+
+        w_attr_1, b_attr_1 = _init_weights_layernorm()
+        self.norm = nn.LayerNorm(embed_dim,
+                                 weight_attr=w_attr_1,
+                                 bias_attr=b_attr_1,
+                                 epsilon=1e-6)
+
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        out = self.norm(x)
+        return out
+
+
+# DONE
+class MobileV2Block(nn.Layer):
+    """Mobilenet v2 InvertedResidual block, hacked from torchvision"""
+    def __init__(self, inp, oup, stride=1, expansion=4):
+        super().__init__()
+        self.stride = stride
+        assert stride in [1, 2]
+
+        hidden_dim = int(round(inp * expansion))
+        self.use_res_connect = self.stride == 1 and inp == oup
+    
+        layers = []
+        if expansion != 1:
+            layers.append(ConvNormAct(inp, hidden_dim, kernel_size=1))
+
+        layers.extend([
+            # dw
+            ConvNormAct(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim, padding=1),
+            # pw-linear
+            nn.Conv2D(hidden_dim, oup, 1, 1, 0, bias_attr=False),
+            nn.BatchNorm2D(oup),
+        ])
+
+        self.conv = nn.Sequential(*layers)
+        self.out_channels = oup
+
+    def forward(self, x):
+        if self.use_res_connect:
+            return x + self.conv(x)
+        return self.conv(x)
+
+
+# DONE
+class MobileViTBlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 hidden_dim,
+                 depth,
+                 num_heads=8,
+                 qkv_bias=True,
+                 mlp_ratio=2.0,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.,
+                 patch_size=(2, 2)):
+        super().__init__()
+        self.patch_h, self.patch_w = patch_size
+
+        # local representations
+        self.conv1 = ConvNormAct(dim, dim, padding=1)
+        self.conv2 = ConvNormAct(dim, hidden_dim, kernel_size=1)
+
+        # global representations
+        self.transformer = Transformer(embed_dim=hidden_dim,
+                                       num_heads=num_heads,
+                                       depth=depth,
+                                       qkv_bias=qkv_bias,
+                                       mlp_ratio=mlp_ratio,
+                                       dropout=dropout,
+                                       attention_dropout=attention_dropout,
+                                       droppath=droppath)
+
+        # fusion
+        self.conv3 = ConvNormAct(hidden_dim, dim, kernel_size=1)
+        # last conv-nxn, the input is concat of input tensor and conv3 output tensor
+        self.conv4 = ConvNormAct(2 * dim, dim, padding=1)
+    
+    def forward(self, x):
+        h = x
+        x = self.conv1(x)
+        x = self.conv2(x)
+        # [B, 96, 32, 32]
+
+        B, C, H, W = x.shape
+        x = x.reshape([B, C, H//self.patch_h, self.patch_w, W//self.patch_w, self.patch_w])
+        # [4, 96, 16, 2, 16, 2]
+        x = x.transpose([0, 1, 3, 5, 2, 4])
+        # [4, 96, 2, 2, 16, 16]
+        x = x.reshape([B, C, (self.patch_h * self.patch_w), -1]) #[B, C, ws**2, n_windows**2]
+        x = x.transpose([0, 2, 3, 1]) #[B, ws**2, n_windows**2, C]
+        # [4, 4, 256, 96]
+        x = self.transformer(x)
+        x = x.reshape([B, self.patch_h, self.patch_w, H//self.patch_h, W//self.patch_w, C])
+        x = x.transpose([0, 5, 3, 1, 4, 2])
+        x = x.reshape([B, C, H, W])
+
+        x = self.conv3(x)
+        x = paddle.concat((h, x), axis=1)
+        x = self.conv4(x)
+        return x
+
+
+# DONE
+class MobileViT(nn.Layer):
+    def __init__(self,
+                 in_channels=3,
+                 dims=[16, 32, 48, 48, 48, 64, 80, 96, 384], # XS
+                 hidden_dims=[96, 120, 144], # d: hidden dims in mobilevit block
+                 num_classes=1000):
+        super().__init__()
+        # [B, 3, 256, 256]
+        self.conv3x3 = ConvNormAct(in_channels, dims[0], kernel_size=3, stride=2, padding=1) 
+        # [B, 16, 128, 128]
+        self.mv2_block_1 = MobileV2Block(dims[0], dims[1])
+
+        # [B, 32, 128, 128]
+        self.mv2_block_2 = MobileV2Block(dims[1], dims[2], stride=2)
+        # [B, 48, 64, 64]
+        self.mv2_block_3 = MobileV2Block(dims[2], dims[3])
+        # [B, 48, 64, 64]
+        self.mv2_block_4 = MobileV2Block(dims[3], dims[4]) # repeat = 2
+
+        # [B, 48, 64, 64]
+        self.mv2_block_5 = MobileV2Block(dims[4], dims[5], stride=2)
+        # [B, 64, 32, 32]
+        self.mvit_block_1 = MobileViTBlock(dims[5], hidden_dims[0], depth=2)
+
+        # [B, 64, 32, 32]
+        self.mv2_block_6 = MobileV2Block(dims[5], dims[6], stride=2)
+        # [B, 80, 16, 16]
+        self.mvit_block_2 = MobileViTBlock(dims[6], hidden_dims[1], depth=4)
+
+        # [B, 80, 16, 16]
+        self.mv2_block_7 = MobileV2Block(dims[6], dims[7], stride=2)
+        # [B, 96, 8, 8]
+        self.mvit_block_3 = MobileViTBlock(dims[7], hidden_dims[2], depth=3)
+
+        # [B, 96, 8, 8]
+        self.conv1x1 = ConvNormAct(dims[7], dims[8], kernel_size=1) 
+
+        # [B, 384, 8, 8]
+        self.pool = nn.AdaptiveAvgPool2D(1)
+        # [B, 384, 1, 1]
+        self.linear = nn.Linear(dims[8], num_classes)
+        # [B, 1000]
+
+    def forward(self, x):
+        x = self.conv3x3(x)
+        x = self.mv2_block_1(x)
+
+        x = self.mv2_block_2(x)
+        x = self.mv2_block_3(x)
+        x = self.mv2_block_4(x)
+        
+        x = self.mv2_block_5(x)
+        x = self.mvit_block_1(x)
+
+        x = self.mv2_block_6(x)
+        x = self.mvit_block_2(x)
+
+        x = self.mv2_block_7(x)
+        x = self.mvit_block_3(x)
+        x = self.conv1x1(x)
+
+        x = self.pool(x)
+        x = x.reshape(x.shape[:2])
+        x = self.linear(x)
+
+        return x
+
+# DONE
+def build_mobile_vit(config):
+    """Build MobileViT by reading options in config object
+    Args:
+        config: config instance contains setting options
+    Returns:
+        model: MobileViT model
+    """
+    model = MobileViT(in_channels=config.MODEL.IN_CHANNELS,
+                      dims=config.MODEL.DIMS,  # XS: [16, 32, 48, 48, 48, 64, 80, 96, 384]
+                      hidden_dims=config.MODEL.HIDDEN_DIMS, # XS: [96, 120, 144], # d: hidden dims in mobilevit block
+                      num_classes=config.MODEL.NUM_CLASSES)
+    return model
+
+
+
+#def main():
+#    paddle.set_device('cpu')
+#    model = MobileViT()
+#    print(model)
+#    t = paddle.randn([4, 3, 256, 256])
+#    out = model(t)
+#    print(out.shape)
+#
+#if __name__  == "__main__":
+#    main()
diff --git a/image_classification/MobileViT/mobilevit.png b/image_classification/MobileViT/mobilevit.png
new file mode 100644
index 00000000..64298b95
Binary files /dev/null and b/image_classification/MobileViT/mobilevit.png differ
diff --git a/image_classification/MobileViT/model_ema.py b/image_classification/MobileViT/model_ema.py
new file mode 100644
index 00000000..d12383b2
--- /dev/null
+++ b/image_classification/MobileViT/model_ema.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.module.to('cpu')
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/MobileViT/multi_scale_samper.py b/image_classification/MobileViT/multi_scale_samper.py
new file mode 100644
index 00000000..957a152c
--- /dev/null
+++ b/image_classification/MobileViT/multi_scale_samper.py
@@ -0,0 +1,75 @@
+from paddle.io import Sampler
+import paddle.distributed as dist
+import math
+import random
+import numpy as np
+
+class MultiScaleSamplerDDP(Sampler):
+    def __init__(self, base_im_w: int, base_im_h: int, base_batch_size: int, n_data_samples: int,
+        min_scale_mult: float = 0.5, max_scale_mult: float = 1.5, n_scales: int = 5, 
+        is_training: bool = False):
+        # min. and max. spatial dimensions
+        min_im_w, max_im_w = int(base_im_w * min_scale_mult), int(base_im_w * max_scale_mult)
+        min_im_h, max_im_h = int(base_im_h * min_scale_mult), int(base_im_h * max_scale_mult)
+        
+        # Get the GPU and node related information
+        num_replicas  =dist.get_world_size()
+        rank = dist.get_rank()
+
+        # adjust the total samples to avoid batch dropping
+        num_samples_per_replica = int(math.ceil(n_data_samples * 1.0 / num_replicas))
+        total_size = num_samples_per_replica * num_replicas
+        img_indices = [idx for idx in range(n_data_samples)]
+        assert len(img_indices) == total_size
+
+        self.shuffle = False
+        if is_training:
+            # compute the spatial dimensions and corresponding batch size
+            width_dims = list(np.linspace(min_im_w, max_im_w, n_scales))
+            height_dims = list(np.linspace(min_im_h, max_im_h, n_scales))
+            # ImageNet models down-sample images by a factor of 32.
+            # Ensure that width and height dimensions are multiples are multiple of 32.
+            width_dims = [(w // 32) * 32 for w in width_dims]
+            height_dims = [(h // 32) * 32 for h in height_dims]
+
+            img_batch_pairs = list()
+            base_elements = base_im_w * base_im_h * base_batch_size
+            for (h, w) in zip(height_dims, width_dims):
+                batch_size = max(1, (base_elements / (h * w)))
+                img_batch_pairs.append((h, w, batch_size))
+            self.img_batch_pairs = img_batch_pairs
+            self.shuffle = True
+        else:
+            self.img_batch_pairs = [(base_im_h , base_im_w , base_batch_size)]
+        
+        self.img_indices = img_indices
+        self.n_samples_per_replica = num_samples_per_replica
+        self.epoch = 0
+        self.rank = rank
+        self.num_replicas = num_replicas
+
+    def __iter__(self):
+        if self.shuffle:
+            random.seed(self.epoch)
+            random.shuffle(self.img_indices)
+            random.shuffle(self.img_batch_pairs)
+            indices_rank_i = self.img_indices[self.rank : len(self.img_indices) : self.num_replicas]
+        else:
+            indices_rank_i = self.img_indices[self.rank : len(self.img_indices) : self.num_replicas]
+
+        start_index = 0
+        while start_index < self.n_samples_per_replica:
+            curr_h, curr_w, curr_bsz = random.choice(self.img_batch_pairs)
+
+            end_index = min(start_index + curr_bsz, self.n_samples_per_replica)
+            batch_ids = indices_rank_i[start_index:end_index]
+            n_batch_samples = len(batch_ids)
+            if n_batch_samples != curr_bsz:
+                    batch_ids += indices_rank_i[:(curr_bsz - n_batch_samples)]
+            start_index += curr_bsz
+
+            if len(batch_ids) > 0:
+                    batch = [(curr_h, curr_w, b_id) for b_id in batch_ids]
+                    yield batch
+    def set_epoch(self, epoch: int):
+        self.epoch = epoch
\ No newline at end of file
diff --git a/image_classification/MobileViT/random_erasing.py b/image_classification/MobileViT/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/MobileViT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/MobileViT/run_eval_s_multi.sh b/image_classification/MobileViT/run_eval_s_multi.sh
new file mode 100644
index 00000000..0743e7cd
--- /dev/null
+++ b/image_classification/MobileViT/run_eval_s_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/mobilevit_s.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-eval \
+-data_path='/dataset/imagenet' \
+-pretrained='./output/train-20211103-21-27-27/MobileViT-Epoch-300-Loss-1.8248680274084845-EMA'
diff --git a/image_classification/MobileViT/run_eval_xxs_multi.sh b/image_classification/MobileViT/run_eval_xxs_multi.sh
new file mode 100644
index 00000000..7dd4b0f4
--- /dev/null
+++ b/image_classification/MobileViT/run_eval_xxs_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/mobilevit_xxs.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-eval \
+-data_path='/dataset/imagenet' \
+-pretrained='output/train-20211027-22-10-32/MobileViT-Epoch-300-Loss-2.4265256190596847-EMA'
diff --git a/image_classification/MobileViT/run_train_s_multi.sh b/image_classification/MobileViT/run_train_s_multi.sh
new file mode 100644
index 00000000..b1f3515d
--- /dev/null
+++ b/image_classification/MobileViT/run_train_s_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/mobilevit_s.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/MobileViT/run_train_xxs.sh b/image_classification/MobileViT/run_train_xxs.sh
new file mode 100644
index 00000000..f7fb514f
--- /dev/null
+++ b/image_classification/MobileViT/run_train_xxs.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/mobilevit_xxs.yaml' \
+-dataset='imagenet2012' \
+-batch_size=4 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/MobileViT/run_train_xxs_multi.sh b/image_classification/MobileViT/run_train_xxs_multi.sh
new file mode 100644
index 00000000..85224ecb
--- /dev/null
+++ b/image_classification/MobileViT/run_train_xxs_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/mobilevit_xxs.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/MobileViT/run_train_xxs_multi_resume.sh b/image_classification/MobileViT/run_train_xxs_multi_resume.sh
new file mode 100644
index 00000000..0d8ccf9c
--- /dev/null
+++ b/image_classification/MobileViT/run_train_xxs_multi_resume.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/mobilevit_xxs.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-resume='output/train-20211026-20-05-16/MobileViT-Epoch-109-Loss-2.623020797677231' \
+-last_epoch=109 \
+-amp
diff --git a/image_classification/MobileViT/stat.py b/image_classification/MobileViT/stat.py
new file mode 100644
index 00000000..0646ea84
--- /dev/null
+++ b/image_classification/MobileViT/stat.py
@@ -0,0 +1,65 @@
+import os
+import glob
+import paddle
+from config import get_config
+from mobile_vit import build_mobile_vit as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    #input_size = (1, 3, 384, 384)
+    input_size = (1, 3, 256, 256)
+    #input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/MobileViT/transforms.py b/image_classification/MobileViT/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/MobileViT/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/MobileViT/utils.py b/image_classification/MobileViT/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/MobileViT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/PVTv2/README.md b/image_classification/PVTv2/README.md
index 925af848..e9eff0e8 100644
--- a/image_classification/PVTv2/README.md
+++ b/image_classification/PVTv2/README.md
@@ -14,19 +14,19 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): Model FLOPs and # params are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| pvtv2_b0 			| 70.47	| 90.16	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1wkx4un6y7V87Rp_ZlD4_pV63QRst-1AE/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mab4dOtBB-HsdzFJYrvgjA)(dxgb) |
-| pvtv2_b1 			| 78.70	| 94.49	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/11hqLxL2MTSnKPb-gp2eMZLAzT6q2UsmG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Ur0s4SEOxVqggmgq6AM-sQ)(2e5m) |
-| pvtv2_b2 			| 82.02	| 95.99	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1-KY6NbS3Y3gCaPaUam0v_Xlk1fT-N1Mz/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FWx0QB7_8_ikrPIOlL7ung)(are2) |
-| pvtv2_b3 			| 83.14	| 96.47	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/16yYV8x7aKssGYmdE-YP99GMg4NKGR5j1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ge0rBsCqIcpIjrVxsrFhnw)(nc21) |
-| pvtv2_b4 			| 83.61	| 96.69	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1gvPdvDeq0VchOUuriTnnGUKh0N2lj-fA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VMSD_Kr_hduCZ5dxmDbLoA)(tthf) |
-| pvtv2_b5 			| 83.77	| 96.61	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1OHaHiHN_AjsGYBN2gxFcQCDhBbTvZ02g/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ey4agxI2Nb0F6iaaX3zAbA)(9v6n) |
-| pvtv2_b2_linear 	| 82.06	| 96.04	| 224 | 0.875 | bicubic | [google](https://drive.google.com/file/d/1hC8wE_XanMPi0_y9apEBKzNc4acZW5Uy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IAhiiaJPe-Lg1Qjxp2p30w)(a4c8) |
-
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| pvtv2_b0 						| 70.47	| 90.16	| 3.7M    | 0.6G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1wkx4un6y7V87Rp_ZlD4_pV63QRst-1AE/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1mab4dOtBB-HsdzFJYrvgjA)(dxgb) |
+| pvtv2_b1 						| 78.70	| 94.49	| 14.0M   | 2.1G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/11hqLxL2MTSnKPb-gp2eMZLAzT6q2UsmG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Ur0s4SEOxVqggmgq6AM-sQ)(2e5m) |
+| pvtv2_b2 						| 82.02	| 95.99	| 25.4M   | 4.0G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1-KY6NbS3Y3gCaPaUam0v_Xlk1fT-N1Mz/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1FWx0QB7_8_ikrPIOlL7ung)(are2) |
+| pvtv2_b2_linear 				| 82.06	| 96.04	| 22.6M   | 3.9G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1hC8wE_XanMPi0_y9apEBKzNc4acZW5Uy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1IAhiiaJPe-Lg1Qjxp2p30w)(a4c8) |
+| pvtv2_b3 						| 83.14	| 96.47	| 45.2M   | 6.8G   | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/16yYV8x7aKssGYmdE-YP99GMg4NKGR5j1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ge0rBsCqIcpIjrVxsrFhnw)(nc21) |
+| pvtv2_b4 						| 83.61	| 96.69	| 62.6M   | 10.0G  | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1gvPdvDeq0VchOUuriTnnGUKh0N2lj-fA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VMSD_Kr_hduCZ5dxmDbLoA)(tthf) |
+| pvtv2_b5 						| 83.77	| 96.61	| 82.0M   | 11.5G  | 224 	    | 0.875    | bicubic 	   | [google](https://drive.google.com/file/d/1OHaHiHN_AjsGYBN2gxFcQCDhBbTvZ02g/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ey4agxI2Nb0F6iaaX3zAbA)(9v6n) |
 > *The results are evaluated on ImageNet2012 validation set.
 ## Notebooks
 We provide a few notebooks in aistudio to help you get started:
@@ -69,8 +69,8 @@ from pvtv2 import build_pvtv2 as build_model
 config = get_config('./configs/pvtv2_b0.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./pvtv2_b0')
+# load pretrained weights
+model_state_dict = paddle.load('./pvtv2_b0.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -83,12 +83,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/pvtv2_b0.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./pvtv2_b0'
+    -pretrained=./path/to/pretrained/model/pvtv2_b0  # .pdparams is NOT needed
 ```
 
 <details>
@@ -105,12 +105,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/pvtv2_b0.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./pvtv2_b0'
+    -pretrained=/path/to/pretrained/model/pvtv2_b0  # .pdparams is NOT needed
 ```
 
 </details>
@@ -125,10 +125,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/pvtv2_b0.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/pvtv2_b0.yaml \
+  -dataset=imagenet2012 \
   -batch_size=16 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 <details>
@@ -145,10 +145,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python main_multi_gpu.py \
-    -cfg='./configs/pvtv2_b0.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=imagenet2012 \
     -batch_size=32 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/PVTv2/__init__.py b/image_classification/PVTv2/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/PVTv2/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/PVTv2/augment.py b/image_classification/PVTv2/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/PVTv2/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/PVTv2/config.py b/image_classification/PVTv2/config.py
index d63aea34..f816565c 100644
--- a/image_classification/PVTv2/config.py
+++ b/image_classification/PVTv2/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -29,24 +29,26 @@
 
 # data settings
 _C.DATA = CN()
-_C.DATA.BATCH_SIZE = 4 #1024 batch_size for single GPU
-_C.DATA.BATCH_SIZE_EVAL = 4 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
 _C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size
 _C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
 _C.MODEL.TYPE = 'PVTv2'
-_C.MODEL.NAME = 'pvtv2_tiny_224'
+_C.MODEL.NAME = 'pvtv2_b0'
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
 _C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
 _C.MODEL.ATTENTION_DROPOUT = 0.0
-_C.MODEL.DROP_PATH = 0.1
 
 # transformer settings
 _C.MODEL.TRANS = CN()
@@ -59,19 +61,22 @@
 _C.MODEL.TRANS.SR_RATIO = [8, 4, 2, 1]
 _C.MODEL.TRANS.QKV_BIAS = True
 _C.MODEL.TRANS.QK_SCALE = None
-_C.MODEL.TRANS.LINEAR = False
+_C.MODEL.TRANS.LINEAR = None
 
 # training settings
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WARMUP_EPOCHS = 5
 _C.TRAIN.WEIGHT_DECAY = 0.05
-_C.TRAIN.BASE_LR = 0.001
-_C.TRAIN.WARMUP_START_LR = 0.0
-_C.TRAIN.END_LR = 0.0
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2
+_C.TRAIN.BASE_LR = 0.0005
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+_C.TRAIN.MODEL_EMA = False
+_C.TRAIN.MODEL_EMA_DECAY = 0.99992
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -80,20 +85,38 @@
 _C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
 
 _C.TRAIN.OPTIMIZER = CN()
-_C.TRAIN.OPTIMIZER.NAME = 'SGD'
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
 _C.TRAIN.OPTIMIZER.EPS = 1e-8
-_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
 
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 20 # freq to save chpt
+_C.SAVE_FREQ = 1 # freq to save chpt
 _C.REPORT_FREQ = 50 # freq to logging info
-_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.VALIDATE_FREQ = 10 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -111,7 +134,6 @@ def _update_config_from_file(config, cfg_file):
     config.merge_from_file(cfg_file)
     config.freeze()
 
-
 def update_config(config, args):
     """Update config by ArgumentParser
     Args:
@@ -128,8 +150,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -141,6 +167,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/PVTv2/configs/pvtv2_b0.yaml b/image_classification/PVTv2/configs/pvtv2_b0.yaml
index c8854b95..69ab355c 100644
--- a/image_classification/PVTv2/configs/pvtv2_b0.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b0.yaml
@@ -12,7 +12,7 @@ MODEL:
         MLP_RATIO: [8, 8, 4, 4]
         SR_RATIO: [8, 4, 2, 1]
         QKV_BIAS: True
-    DROP_PATH: 0.1
+    DROPPATH: 0.1
 TRAIN:
     GRAD_CLIP: None
 
diff --git a/image_classification/PVTv2/configs/pvtv2_b1.yaml b/image_classification/PVTv2/configs/pvtv2_b1.yaml
index 95135935..69a7a1ba 100644
--- a/image_classification/PVTv2/configs/pvtv2_b1.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b1.yaml
@@ -12,7 +12,7 @@ MODEL:
         MLP_RATIO: [8, 8, 4, 4]
         SR_RATIO: [8, 4, 2, 1]
         QKV_BIAS: True
-    DROP_PATH: 0.1
+    DROPPATH: 0.1
 TRAIN:
     GRAD_CLIP: None
 
diff --git a/image_classification/PVTv2/configs/pvtv2_b2.yaml b/image_classification/PVTv2/configs/pvtv2_b2.yaml
index 5102f3d3..b6871317 100644
--- a/image_classification/PVTv2/configs/pvtv2_b2.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b2.yaml
@@ -12,7 +12,7 @@ MODEL:
         MLP_RATIO: [8, 8, 4, 4]
         SR_RATIO: [8, 4, 2, 1]
         QKV_BIAS: True
-    DROP_PATH: 0.1
+    DROPPATH: 0.1
 TRAIN:
     GRAD_CLIP: None
 
diff --git a/image_classification/PVTv2/configs/pvtv2_b2_linear.yaml b/image_classification/PVTv2/configs/pvtv2_b2_linear.yaml
index 10e8384c..82bcd1b3 100644
--- a/image_classification/PVTv2/configs/pvtv2_b2_linear.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b2_linear.yaml
@@ -13,7 +13,7 @@ MODEL:
         SR_RATIO: [8, 4, 2, 1]
         LINEAR: True
         QKV_BIAS: True
-    DROP_PATH: 0.1
+    DROPPATH: 0.1
 TRAIN:
     GRAD_CLIP: None
 
diff --git a/image_classification/PVTv2/configs/pvtv2_b3.yaml b/image_classification/PVTv2/configs/pvtv2_b3.yaml
index 823a1889..75a21f47 100644
--- a/image_classification/PVTv2/configs/pvtv2_b3.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b3.yaml
@@ -12,7 +12,7 @@ MODEL:
         MLP_RATIO: [8, 8, 4, 4]
         SR_RATIO: [8, 4, 2, 1]
         QKV_BIAS: True
-    DROP_PATH: 0.3
+    DROPPATH: 0.3
 TRAIN:
     GRAD_CLIP: 1.0
 
diff --git a/image_classification/PVTv2/configs/pvtv2_b4.yaml b/image_classification/PVTv2/configs/pvtv2_b4.yaml
index f8f3472e..ce0aef13 100644
--- a/image_classification/PVTv2/configs/pvtv2_b4.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b4.yaml
@@ -12,7 +12,7 @@ MODEL:
         MLP_RATIO: [8, 8, 4, 4]
         SR_RATIO: [8, 4, 2, 1]
         QKV_BIAS: True
-    DROP_PATH: 0.3
+    DROPPATH: 0.3
 TRAIN:
     GRAD_CLIP: 1.0
 
diff --git a/image_classification/PVTv2/configs/pvtv2_b5.yaml b/image_classification/PVTv2/configs/pvtv2_b5.yaml
index fea21eb1..0c2a9766 100644
--- a/image_classification/PVTv2/configs/pvtv2_b5.yaml
+++ b/image_classification/PVTv2/configs/pvtv2_b5.yaml
@@ -12,7 +12,7 @@ MODEL:
         MLP_RATIO: [4, 4, 4, 4]
         SR_RATIO: [8, 4, 2, 1]
         QKV_BIAS: True
-    DROP_PATH: 0.3
+    DROPPATH: 0.3
 TRAIN:
     GRAD_CLIP: 1.0
 
diff --git a/image_classification/PVTv2/datasets.py b/image_classification/PVTv2/datasets.py
index 10ba78fe..23c0b1f3 100644
--- a/image_classification/PVTv2/datasets.py
+++ b/image_classification/PVTv2/datasets.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,14 +13,26 @@
 # limitations under the License.
 
 """
-Dataset related classes and methods for PvTv2 training and validation
+Dataset related classes and methods for ViT training and validation
 Cifar10, Cifar100 and ImageNet2012 are supported
 """
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -60,7 +72,7 @@ def __len__(self):
         return len(self.label_list)
 
     def __getitem__(self, index):
-        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = Image.open(self.img_path_list[index]).convert('RGB')
         data = self.transform(data)
         label = self.label_list[index]
 
@@ -71,8 +83,7 @@ def get_train_transforms(config):
     """ Get training transforms
 
     For training, a RandomResizedCrop is applied, then normalization is applied with
-    [0.485, 0.456, 0.406], mean and [0.229, 0.224, 0.225] std. 
-    The input pixel values must be rescaled to [0, 1.]
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
     Outputs is converted to tensor
 
     Args:
@@ -81,12 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -94,7 +129,7 @@ def get_val_transforms(config):
     """ Get training transforms
 
     For validation, image is first Resize then CenterCrop to image_size.
-    Then normalization is applied with [0.485, 0.456, 0.406] mean and [0.229, 0.224, 0.225] std.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
     The input pixel values must be rescaled to [0, 1.]
     Outputs is converted to tensor
 
@@ -109,7 +144,7 @@ def get_val_transforms(config):
         transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/PVTv2/losses.py b/image_classification/PVTv2/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/PVTv2/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/PVTv2/main_multi_gpu.py b/image_classification/PVTv2/main_multi_gpu.py
index cb8dc67e..ee32f276 100644
--- a/image_classification/PVTv2/main_multi_gpu.py
+++ b/image_classification/PVTv2/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -25,54 +25,58 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from pvtv2 import build_pvtv2 as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from pvtv2 import build_pvtv2 as build_model
 
 
-parser = argparse.ArgumentParser('PVTv2')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('PVTv2')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,83 +84,157 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -171,56 +249,144 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -242,7 +408,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -269,80 +437,132 @@ def main_worker(*args):
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
-                'absolute_pos_embed', 'relative_position_bias_table']),
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -350,15 +570,38 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/PVTv2/main_single_gpu.py b/image_classification/PVTv2/main_single_gpu.py
index f397191a..0b282077 100644
--- a/image_classification/PVTv2/main_single_gpu.py
+++ b/image_classification/PVTv2/main_single_gpu.py
@@ -1,5 +1,4 @@
-
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,53 +26,56 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from pvtv2 import build_pvtv2 as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from pvtv2 import build_pvtv2 as build_model
 
 
-parser = argparse.ArgumentParser('PVTv2')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('PVTv2')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -81,56 +83,87 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        if model_ema is not None:
+            model_ema.update(model)
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -139,18 +172,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -175,7 +210,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -187,24 +222,81 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -213,8 +305,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -226,9 +317,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -248,63 +339,76 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
-                'absolute_pos_embed', 'relative_position_bias_table']),
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 6. Load pretrained model or load resume model and optimizer states
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
-        optimizer.set_dict(opt_state)
+        optimizer.set_state_dict(opt_state)
         logger.info(
-            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -316,9 +420,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
@@ -332,6 +437,11 @@ def main():
             paddle.save(optimizer.state_dict(), model_path + '.pdopt')
             logger.info(f"----- Save model: {model_path}.pdparams")
             logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 if __name__ == "__main__":
diff --git a/image_classification/PVTv2/mixup.py b/image_classification/PVTv2/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/PVTv2/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/PVTv2/model_ema.py b/image_classification/PVTv2/model_ema.py
new file mode 100644
index 00000000..d12383b2
--- /dev/null
+++ b/image_classification/PVTv2/model_ema.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.module.to('cpu')
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/PVTv2/port_weights/__init__.py b/image_classification/PVTv2/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/PVTv2/pvtv2.py b/image_classification/PVTv2/pvtv2.py
index b1bdf12f..3896978d 100644
--- a/image_classification/PVTv2/pvtv2.py
+++ b/image_classification/PVTv2/pvtv2.py
@@ -44,7 +44,20 @@ class DWConv(nn.Layer):
     """
     def __init__(self, dim=768):
         super(DWConv, self).__init__()
-        self.dwconv = nn.Conv2D(dim, dim, 3, 1, 1, bias_attr=True, groups=dim)
+        w_attr_1, b_attr_1 = self._init_weights_conv() # init for conv
+        self.dwconv = nn.Conv2D(in_channels=dim,
+                                out_channels=dim,
+                                kernel_size=3,
+                                stride=1,
+                                padding=1,
+                                groups=dim,
+                                weight_attr=w_attr_1,
+                                bias_attr=b_attr_1)
+
+    def _init_weights_conv(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.XavierNormal(fan_in=0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
 
     def forward(self, x, H, W):
         B, _, C = x.shape
@@ -84,11 +97,13 @@ def __init__(self, image_size=224, patch_size=7, stride=4, in_channels=3, embed_
                                      kernel_size=patch_size, 
                                      stride=stride,
                                      padding=(patch_size[0] // 2, patch_size[1] // 2))
-        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6)
+
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.norm = nn.LayerNorm(embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1, epsilon=1e-6)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
@@ -138,8 +153,8 @@ def __init__(self, in_features, hidden_features, dropout=0.0, linear=False):
             self.relu = nn.ReLU()
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
-        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
         return weight_attr, bias_attr
 
     def forward(self, x, H, W):
@@ -190,30 +205,70 @@ def __init__(self,
         self.dim_head = dim // num_heads
         self.scale = qk_scale or self.dim_head ** -0.5
 
-        self.q = nn.Linear(dim, dim, bias_attr=qkv_bias)
-        self.kv = nn.Linear(dim, dim * 2, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.q = nn.Linear(dim,
+                           dim,
+                           weight_attr=w_attr_1,
+                           bias_attr=b_attr_1 if qkv_bias else False)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.kv = nn.Linear(dim,
+                            dim * 2,
+                            weight_attr=w_attr_2,
+                            bias_attr=b_attr_2 if qkv_bias else False)
         self.attn_dropout = nn.Dropout(attention_dropout)
-        self.proj = nn.Linear(dim, dim)
+        w_attr_3, b_attr_3 = self._init_weights()
+        self.proj = nn.Linear(dim,
+                              dim,
+                              weight_attr=w_attr_3,
+                              bias_attr=b_attr_3)
         self.proj_dropout = nn.Dropout(dropout)
         self.softmax = nn.Softmax(axis=-1)
 
         self.linear = linear
         self.sr_ratio = sr_ratio
+        w_attr_4, b_attr_4 = self._init_weights_conv() # init for conv
+        w_attr_5, b_attr_5 = self._init_weights_layernorm() # init for layernorm
         if not linear:
             if sr_ratio > 1:
-                self.sr = nn.Conv2D(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
-                self.norm = nn.LayerNorm(dim, epsilon=1e-5)
+                self.sr = nn.Conv2D(dim,
+                                    dim,
+                                    kernel_size=sr_ratio,
+                                    stride=sr_ratio,
+                                    weight_attr=w_attr_4,
+                                    bias_attr=b_attr_4)
+                self.norm = nn.LayerNorm(dim,
+                                         epsilon=1e-5,
+                                         weight_attr=w_attr_5,
+                                         bias_attr=b_attr_5)
         else:
             self.pool = nn.AdaptiveAvgPool2D(7)
-            self.sr = nn.Conv2D(dim, dim, kernel_size=1, stride=1)
-            self.norm = nn.LayerNorm(dim, epsilon=1e-5)
+            self.sr = nn.Conv2D(dim,
+                                dim,
+                                kernel_size=1,
+                                stride=1,
+                                weight_attr=w_attr_4,
+                                bias_attr=b_attr_4)
+            self.norm = nn.LayerNorm(dim,
+                                     epsilon=1e-5,
+                                     weight_attr=w_attr_5,
+                                     bias_attr=b_attr_5)
             self.act = nn.GELU()
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
         
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights_conv(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.XavierNormal(fan_in=0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
     def forward(self, x, H, W):
         B, N, C = x.shape
         q = self.q(x).reshape([B, N, self.num_heads, C // self.num_heads]).transpose([0, 2, 1, 3])
@@ -269,7 +324,8 @@ class PvTv2Block(nn.Layer):
     def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, dropout=0., 
                  attention_dropout=0., drop_path=0., sr_ratio=1, linear=False):
         super(PvTv2Block, self).__init__()
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm() # init for layernorm
+        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6, weight_attr=w_attr_1, bias_attr=b_attr_1)
         self.attn = Attention(dim,
                               num_heads=num_heads, 
                               qkv_bias=qkv_bias, 
@@ -280,15 +336,16 @@ def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None,
                               linear=linear)
 
         self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
-        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights_layernorm() # init for layernorm
+        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6, weight_attr=w_attr_2, bias_attr=b_attr_2)
         self.mlp = Mlp(in_features=dim, 
                        hidden_features=int(dim*mlp_ratio), 
                        dropout=dropout, 
                        linear=linear)
 
-    def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x, H, W):
@@ -362,18 +419,30 @@ def __init__(self,
         cur = 0
 
         for i in range(self.num_stages):
-            patch_embedding = OverlapPatchEmbedding(image_size=self.image_size if i == 0 else self.image_size // (2 ** (i + 1)),
-                                                patch_size=7 if i == 0 else 3,
-                                                stride=4 if i == 0 else 2,
-                                                in_channels=self.in_channels if i == 0 else self.embed_dims[i - 1],
-                                                embed_dim=self.embed_dims[i])
+            patch_embedding = OverlapPatchEmbedding(
+                image_size=self.image_size if i == 0 else self.image_size // (2 ** (i + 1)),
+                patch_size=7 if i == 0 else 3,
+                stride=4 if i == 0 else 2,
+                in_channels=self.in_channels if i == 0 else self.embed_dims[i - 1],
+                embed_dim=self.embed_dims[i])
 
             block = nn.LayerList([copy.deepcopy(PvTv2Block(
-                dim=self.embed_dims[i], num_heads=self.num_heads[i], mlp_ratio=self.mlp_ratio[i], qkv_bias=self.qkv_bias, 
-                qk_scale=self.qk_scale, dropout=self.dropout, attention_dropout=self.attention_dropout, 
-                drop_path=depth_decay[cur + j], sr_ratio=self.sr_ratio[i], linear=self.linear))
-                for j in range(self.depths[i])])
-            norm = nn.LayerNorm(self.embed_dims[i], epsilon=1e-6)
+                dim=self.embed_dims[i],
+                num_heads=self.num_heads[i],
+                mlp_ratio=self.mlp_ratio[i],
+                qkv_bias=self.qkv_bias, 
+                qk_scale=self.qk_scale,
+                dropout=self.dropout,
+                attention_dropout=self.attention_dropout, 
+                drop_path=depth_decay[cur + j],
+                sr_ratio=self.sr_ratio[i],
+                linear=self.linear)) for j in range(self.depths[i])])
+
+            w_attr_1, b_attr_1 = self._init_weights_layernorm() # init for layernorm
+            norm = nn.LayerNorm(self.embed_dims[i],
+                                epsilon=1e-6,
+                                weight_attr=w_attr_1,
+                                bias_attr=b_attr_1)
             cur += self.depths[i]
 
             setattr(self, f"patch_embedding{i + 1}", patch_embedding)
@@ -381,11 +450,20 @@ def __init__(self,
             setattr(self, f"norm{i + 1}", norm)
 
         # classification head
-        self.head = nn.Linear(self.embed_dims[3], self.num_classes) if self.num_classes > 0 else Identity()
+        w_attr_2, b_attr_2 = self._init_weights() # init for linear
+        self.head = nn.Linear(self.embed_dims[3],
+                              self.num_classes,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2) if self.num_classes > 0 else Identity()
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
         
     def freeze_patch_embedding(self):
@@ -430,6 +508,6 @@ def build_pvtv2(config):
         qk_scale=config.MODEL.TRANS.QK_SCALE,
         dropout=config.MODEL.DROPOUT,
         attention_dropout=config.MODEL.ATTENTION_DROPOUT,
-        drop_path=config.MODEL.DROP_PATH,
+        drop_path=config.MODEL.DROPPATH,
         linear=config.MODEL.TRANS.LINEAR)
     return model
diff --git a/image_classification/PVTv2/random_erasing.py b/image_classification/PVTv2/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/PVTv2/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/PVTv2/run_train.sh b/image_classification/PVTv2/run_train.sh
index c9616488..30f31347 100644
--- a/image_classification/PVTv2/run_train.sh
+++ b/image_classification/PVTv2/run_train.sh
@@ -2,5 +2,6 @@ CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
 -cfg='./configs/pvtv2_b0.yaml' \
 -dataset='imagenet2012' \
--batch_size=16 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/PVTv2/run_train_multi.sh b/image_classification/PVTv2/run_train_multi.sh
index e780da73..72fc7e2b 100644
--- a/image_classification/PVTv2/run_train_multi.sh
+++ b/image_classification/PVTv2/run_train_multi.sh
@@ -1,7 +1,7 @@
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/pvtv2_b0.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
--ngpus=8
+-amp
diff --git a/image_classification/PVTv2/stat.py b/image_classification/PVTv2/stat.py
new file mode 100644
index 00000000..c67dd3f8
--- /dev/null
+++ b/image_classification/PVTv2/stat.py
@@ -0,0 +1,61 @@
+import os
+import glob
+import paddle
+from config import get_config
+from pvtv2 import build_pvtv2 as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/PVTv2/transforms.py b/image_classification/PVTv2/transforms.py
new file mode 100644
index 00000000..676fe1ff
--- /dev/null
+++ b/image_classification/PVTv2/transforms.py
@@ -0,0 +1,13 @@
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/PiT/README.md b/image_classification/PiT/README.md
new file mode 100644
index 00000000..6af9e3f5
--- /dev/null
+++ b/image_classification/PiT/README.md
@@ -0,0 +1,173 @@
+# Rethinking Spatial Dimensions of Vision Transformers, [arxiv](https://arxiv.org/abs/2103.16302)
+
+PaddlePaddle training/validation code and pretrained models for **PiT**.
+
+The official pytorch implementation is [here](https://github.com/naver-ai/pit).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./pit.png" alt="drawing" width="90%" height="90%"/>
+    <h4 align="center">PiT Model Overview</h4>
+</p>
+
+
+### Update 
+* Update (2021-12-08): Code is updated and ported weights are uploaded.
+* Update (2021-11-13): Code is released.
+
+## Models Zoo
+| Model          | Acc@1 	| Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|----------------|----------|-------|---------|--------|------------|----------|---------------|--------------|
+| pit_ti 	     | 72.91	| 91.40	| 4.8M    | 0.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1bbeqzlR_CFB8CAyTUN52p2q6ii8rt0AW/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Yrq5Q16MolPYHQsT_9P1mw)(ydmi)  |
+| pit_ti_distill | 74.54	| 92.10 | 5.1M    | 0.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1m4L0OVI0sYh8vCv37WhqCumRSHJaizqX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1RIM9NGq6pwfNN7GJ5WZg2w)(7k4s)  |
+| pit_xs 	     | 78.18    | 94.16 | 10.5M   | 1.1G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1qoMQ-pmqLRQmvAwZurIbpvgMK8MOEgqJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15d7ep05vI2UoKvL09Zf_wg)(gytu)  |
+| pit_xs_distill | 79.31 	| 94.36 | 10.9M   | 1.1G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1EfHOIiTJOR-nRWE5AsnJMsPCncPHEgl8/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DqlgVF7U5qHfGD3QJAad4A)(ie7s)  |
+| pit_s  		 | 81.08 	| 95.33 | 23.4M   | 2.4G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1TDSybTrwQpcFf9PgCIhGX1t-f_oak66W/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Vk-W1INskQq7J5Qs4yphCg)(kt1n)  |
+| pit_s_distill  | 81.99 	| 95.79 | 24.0M   | 2.5G   | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1U3VPP6We1vIaX-M3sZuHmFhCQBI9g_dL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1L7rdWmMW8tiGkduqmak9Fw)(hhyc)  |
+| pit_b   		 | 82.44 	| 95.71 | 73.5M	  | 10.6G  | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/1-NBZ9-83nZ52jQ4DNZAIj8Xv6oh54nx-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1XRDPY4OxFlDfl8RMQ56rEg)(uh2v)  |
+| pit_b_distill  | 84.14 	| 96.86 | 74.5M   | 10.7G  | 224        | 0.9      | bicubic       |[google](https://drive.google.com/file/d/12Yi4eWDQxArhgQb96RXkNWjRoCsDyNo9/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1vJOUGXPtvC0abg-jnS4Krw)(3e6g)  |
+> *The results are evaluated on ImageNet2012 validation set.
+
+| Teacher Model | Link |
+| -- | -- |
+| RegNet_Y_160  | [google](https://drive.google.com/file/d/1_nEYFnQqlGGqboLq_VmdRvV9mLGSrbyG/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NZNhiO4xDfqHiRiIbk9BCA)(gjsm)   |
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./swin_base_patch4_window7_224.pdparams`, to use the `swin_base_patch4_window7_224` model in python:
+```python
+from config import get_config
+from pit import build_pit as build_model
+# config files in ./configs/
+config = get_config('./configs/pit_ti.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./pit_ti')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate PiT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/pit_ti.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./pit_ti'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/pit_ti.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./pit_ti'
+```
+
+</details>
+
+
+## Training
+To train the PiT model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_singel_gpu.py \
+  -cfg='./configs/pit_ti.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/pit_ti.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@inproceedings{heo2021pit,
+    title={Rethinking Spatial Dimensions of Vision Transformers},
+    author={Byeongho Heo and Sangdoo Yun and Dongyoon Han and Sanghyuk Chun and Junsuk Choe and Seong Joon Oh},
+    booktitle = {International Conference on Computer Vision (ICCV)},
+    year={2021},
+}
+```
diff --git a/image_classification/PiT/__init__.py b/image_classification/PiT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/PiT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/PiT/augment.py b/image_classification/PiT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/PiT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/PiT/config.py b/image_classification/PiT/config.py
new file mode 100644
index 00000000..4aa9674e
--- /dev/null
+++ b/image_classification/PiT/config.py
@@ -0,0 +1,188 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'PiT'
+_C.MODEL.NAME = 'PiT'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DISTILL = True
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PATCH_SIZE = 14
+_C.MODEL.TRANS.STRIDE = 7
+_C.MODEL.TRANS.BASE_DIMS = [64, 64, 64]
+_C.MODEL.TRANS.DEPTH = [3, 6, 4]
+_C.MODEL.TRANS.HEADS = [4, 8, 16]
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 5e-4
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+_C.TRAIN.DISTILLATION_TYPE = 'hard' # hard, soft, none 
+_C.TRAIN.DISTILLATION_ALPHA = 0.5
+_C.TRAIN.DISTILLATION_TAU = 1.0
+_C.TRAIN.TEACHER_MODEL = './regnety_160' # no ext is needed
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+    if args.teacher_model:
+        config.TRAIN.TEACHER_MODEL = args.teacher_model
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/PiT/configs/pit_b.yaml b/image_classification/PiT/configs/pit_b.yaml
new file mode 100644
index 00000000..32dc3fb8
--- /dev/null
+++ b/image_classification/PiT/configs/pit_b.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_b
+    DISTILL: False
+    TRANS:
+        PATCH_SIZE: 14
+        STRIDE: 7
+        BASE_DIMS: [64, 64, 64]
+        DEPTH: [3, 6, 4]
+        HEADS: [4, 8, 16]
diff --git a/image_classification/PiT/configs/pit_b_distill.yaml b/image_classification/PiT/configs/pit_b_distill.yaml
new file mode 100644
index 00000000..9273e238
--- /dev/null
+++ b/image_classification/PiT/configs/pit_b_distill.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_b_distilled
+    DISTILL: True
+    TRANS:
+        PATCH_SIZE: 14
+        STRIDE: 7
+        BASE_DIMS: [64, 64, 64]
+        DEPTH: [3, 6, 4]
+        HEADS: [4, 8, 16]
diff --git a/image_classification/PiT/configs/pit_s.yaml b/image_classification/PiT/configs/pit_s.yaml
new file mode 100644
index 00000000..aa01415e
--- /dev/null
+++ b/image_classification/PiT/configs/pit_s.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_s
+    DISTILL: False
+    TRANS:
+        PATCH_SIZE: 16
+        STRIDE: 8
+        BASE_DIMS: [48, 48, 48]
+        DEPTH: [2, 6, 4]
+        HEADS: [3, 6, 12]
diff --git a/image_classification/PiT/configs/pit_s_distill.yaml b/image_classification/PiT/configs/pit_s_distill.yaml
new file mode 100644
index 00000000..66de42d2
--- /dev/null
+++ b/image_classification/PiT/configs/pit_s_distill.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_s_distilled
+    DISTILL: True
+    TRANS:
+        PATCH_SIZE: 16
+        STRIDE: 8
+        BASE_DIMS: [48, 48, 48]
+        DEPTH: [2, 6, 4]
+        HEADS: [3, 6, 12]
diff --git a/image_classification/PiT/configs/pit_ti.yaml b/image_classification/PiT/configs/pit_ti.yaml
new file mode 100644
index 00000000..e6a35199
--- /dev/null
+++ b/image_classification/PiT/configs/pit_ti.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_ti
+    DISTILL: False
+    TRANS:
+        PATCH_SIZE: 16
+        STRIDE: 8
+        BASE_DIMS: [32, 32, 32]
+        DEPTH: [2, 6, 4]
+        HEADS: [2, 4, 8]
diff --git a/image_classification/PiT/configs/pit_ti_distill.yaml b/image_classification/PiT/configs/pit_ti_distill.yaml
new file mode 100644
index 00000000..7ad976d9
--- /dev/null
+++ b/image_classification/PiT/configs/pit_ti_distill.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_ti_distilled
+    DISTILL: True
+    TRANS:
+        PATCH_SIZE: 16
+        STRIDE: 8
+        BASE_DIMS: [32, 32, 32]
+        DEPTH: [2, 6, 4]
+        HEADS: [2, 4, 8]
diff --git a/image_classification/PiT/configs/pit_xs.yaml b/image_classification/PiT/configs/pit_xs.yaml
new file mode 100644
index 00000000..1ba0da59
--- /dev/null
+++ b/image_classification/PiT/configs/pit_xs.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_xs
+    DISTILL: False
+    TRANS:
+        PATCH_SIZE: 16
+        STRIDE: 8
+        BASE_DIMS: [48, 48, 48]
+        DEPTH: [2, 6, 4]
+        HEADS: [2, 4, 8]
diff --git a/image_classification/PiT/configs/pit_xs_distill.yaml b/image_classification/PiT/configs/pit_xs_distill.yaml
new file mode 100644
index 00000000..bb9ca2b7
--- /dev/null
+++ b/image_classification/PiT/configs/pit_xs_distill.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: PiT
+    NAME: pit_xs_distilled
+    DISTILL: True
+    TRANS:
+        PATCH_SIZE: 16
+        STRIDE: 8
+        BASE_DIMS: [48, 48, 48]
+        DEPTH: [2, 6, 4]
+        HEADS: [2, 4, 8]
diff --git a/image_classification/PiT/datasets.py b/image_classification/PiT/datasets.py
new file mode 100644
index 00000000..7e178b57
--- /dev/null
+++ b/image_classification/PiT/datasets.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/PiT/droppath.py b/image_classification/PiT/droppath.py
new file mode 100644
index 00000000..d7ecf00c
--- /dev/null
+++ b/image_classification/PiT/droppath.py
@@ -0,0 +1,61 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    out = dp(tmp)
+#    print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/PiT/losses.py b/image_classification/PiT/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/PiT/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/PiT/main_multi_gpu.py b/image_classification/PiT/main_multi_gpu.py
new file mode 100644
index 00000000..1da841b2
--- /dev/null
+++ b/image_classification/PiT/main_multi_gpu.py
@@ -0,0 +1,633 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""PiT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from pit import build_pit as build_model
+from regnet import build_regnet as build_teacher_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('PiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-teacher_model', type=str, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image) # output[0]: class_token, output[1]: distill_token
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image) # output[0]: class_token, output[1]: distill_token
+            loss = criterion(image, output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
+
+        # average of output and kd_output, like model eval mode
+        pred = F.softmax((output[0] + output[1]) / 2)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+
+    # 5. Create Teacher model
+    teacher_model = None
+    if not config.EVAL:
+        if config.TRAIN.DISTILLATION_TYPE != 'none':
+            local_logger.info(f'Creating teacher model: {config.TRAIN.TEACHER_MODEL}')
+            teacher_model = build_teacher_model()
+            assert os.path.isfile(config.TRAIN.TEACHER_MODEL + '.pdparams')
+            teacher_model_state = paddle.load(config.TRAIN.TEACHER_MODEL + '.pdparams')
+            teacher_model.set_dict(teacher_model_state)
+            teacher_model.eval()
+            teacher_model = paddle.DataParallel(teacher_model)
+            local_logger.info(f"----- Load teacher model state from {config.TRAIN.TEACHER_MODEL}")
+            # wrap the criterion:
+            criterion = DistillationLoss(criterion,
+                                         teacher_model,
+                                         config.TRAIN.DISTILLATION_TYPE,
+                                         config.TRAIN.DISTILLATION_ALPHA,
+                                         config.TRAIN.DISTILLATION_TAU)
+        else:
+            raise ValueError('Distillation type cannot be None')
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/PiT/main_single_gpu.py b/image_classification/PiT/main_single_gpu.py
new file mode 100644
index 00000000..3126bac7
--- /dev/null
+++ b/image_classification/PiT/main_single_gpu.py
@@ -0,0 +1,475 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""PiT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import copy
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from pit import build_pit as build_model
+from regnet import build_regnet as build_teacher_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('PiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-teacher_model', type=str, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(image, output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        if model_ema is not None:
+            model_ema.update(model)
+
+        # average of output and kd_output, like model eval mode
+        pred = F.softmax((output[0] + output[1]) / 2)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Create Teacher model
+    teacher_model = None
+    if not config.EVAL:
+        if config.TRAIN.DISTILLATION_TYPE != 'none':
+            logger.info(f'Creating teacher model: {config.TRAIN.TEACHER_MODEL}')
+            teacher_model = build_teacher_model() 
+            assert os.path.isfile(config.TRAIN.TEACHER_MODEL + '.pdparams')
+            teacher_model_state = paddle.load(config.TRAIN.TEACHER_MODEL + '.pdparams')
+            teacher_model.set_dict(teacher_model_state)
+            teacher_model.eval()
+            logger.info(f"----- Load teacher model state from {config.TRAIN.TEACHER_MODEL}")
+            # wrap the criterion:
+            criterion = DistillationLoss(criterion,
+                                         teacher_model,
+                                         config.TRAIN.DISTILLATION_TYPE,
+                                         config.TRAIN.DISTILLATION_ALPHA,
+                                         config.TRAIN.DISTILLATION_TAU)
+        else:
+            logger.fatal('Distillation type cannot be None')
+            raise ValueError('Distillation type cannot be None')
+
+    # STEP 6: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from official code)
+    
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 7: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 8: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 9: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/PiT/mixup.py b/image_classification/PiT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/PiT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/PiT/model_ema.py b/image_classification/PiT/model_ema.py
new file mode 100644
index 00000000..38fe030c
--- /dev/null
+++ b/image_classification/PiT/model_ema.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
diff --git a/image_classification/PiT/pit.png b/image_classification/PiT/pit.png
new file mode 100644
index 00000000..7d2e3c09
Binary files /dev/null and b/image_classification/PiT/pit.png differ
diff --git a/image_classification/PiT/pit.py b/image_classification/PiT/pit.py
new file mode 100644
index 00000000..c1ad2547
--- /dev/null
+++ b/image_classification/PiT/pit.py
@@ -0,0 +1,404 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for PiT
+"""
+
+import math
+from functools import partial
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from droppath import DropPath
+
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+
+
+class Identity(nn.Layer):
+    """ Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.0):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.drop1 = nn.Dropout(drop)
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop2 = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop1(x)
+        x = self.fc2(x)
+        x = self.drop2(x)
+        return x
+
+
+class Attention(nn.Layer):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0.0, proj_drop=0.0):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim ** -0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = (
+            self.qkv(x)
+            .reshape([B, N, 3, self.num_heads, C // self.num_heads])
+            .transpose([2, 0, 3, 1, 4])
+        )
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        attn = (q @ k.transpose([0, 1, 3, 2])) * self.scale
+        attn = F.softmax(attn, axis=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose([0, 2, 1, 3]).reshape([B, N, C])
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class TransformerBlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.0,
+                 qkv_bias=False,
+                 drop=0.0,
+                 attn_drop=0.0,
+                 drop_path=0.0,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+        )
+
+    def forward(self, x):
+        x = x + self.drop_path(self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+
+class Transformer(nn.Layer):
+    def __init__(self,
+                 base_dim,
+                 depth,
+                 heads,
+                 mlp_ratio,
+                 drop_rate=0.0,
+                 attn_drop_rate=0.0,
+                 drop_path_prob=None):
+        super().__init__()
+        self.layers = nn.LayerList([])
+        embed_dim = base_dim * heads
+
+        if drop_path_prob is None:
+            drop_path_prob = [0.0 for _ in range(depth)]
+
+        self.blocks = nn.LayerList(
+            [
+                TransformerBlock(
+                    dim=embed_dim,
+                    num_heads=heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=True,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=drop_path_prob[i],
+                    norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
+                )
+                for i in range(depth)
+            ]
+        )
+
+    def forward(self, x, cls_tokens):
+        b, c, h, w = x.shape
+        # x = rearrange(x, 'b c h w -> b (h w) c')
+        x = x.transpose([0, 2, 3, 1]).reshape([b, h * w, c])
+
+        token_length = cls_tokens.shape[1]
+        x = paddle.concat([cls_tokens, x], axis=1)
+        for blk in self.blocks:
+            x = blk(x)
+
+        cls_tokens = x[:, :token_length]
+        x = x[:, token_length:]
+        x = x.transpose([0, 2, 1]).reshape([b, c, h, w])
+
+        return x, cls_tokens
+
+
+class ConvHeadPooling(nn.Layer):
+    def __init__(self, in_feature, out_feature, stride, padding_mode="zeros"):
+        super().__init__()
+
+        self.conv = nn.Conv2D(
+            in_feature,
+            out_feature,
+            kernel_size=stride + 1,
+            padding=stride // 2,
+            stride=stride,
+            padding_mode=padding_mode,
+            groups=in_feature,
+        )
+        self.fc = nn.Linear(in_feature, out_feature)
+
+    def forward(self, x, cls_token):
+        x = self.conv(x)
+        cls_token = self.fc(cls_token)
+        return x, cls_token
+
+
+class ConvEmbedding(nn.Layer):
+    def __init__(self, in_channels, out_channels, patch_size, stride, padding):
+        super().__init__()
+        self.conv = nn.Conv2D(
+            in_channels,
+            out_channels,
+            kernel_size=patch_size,
+            stride=stride,
+            padding=padding,
+            bias_attr=True,
+        )
+
+    def forward(self, x):
+        x = self.conv(x)
+        return x
+
+
+class PoolingTransformer(nn.Layer):
+    def __init__(self,
+                 image_size,
+                 patch_size,
+                 stride,
+                 base_dims,
+                 depth,
+                 heads,
+                 mlp_ratio=4,
+                 num_classes=1000,
+                 in_chans=3,
+                 attn_drop_rate=0.0,
+                 drop_rate=0.0,
+                 drop_path_rate=0.0):
+        super().__init__()
+        total_block = sum(depth)
+        padding = 0
+        block_idx = 0
+
+        width = math.floor((image_size + 2 * padding - patch_size) / stride + 1)
+
+        self.base_dims = base_dims
+        self.heads = heads
+        self.num_classes = num_classes
+        self.depth = depth
+
+        self.patch_size = patch_size
+
+        self.pos_embed = paddle.create_parameter(
+            shape=[1, base_dims[0] * heads[0], width, width],
+            dtype="float32",
+            default_initializer=trunc_normal_,
+        )
+
+        self.patch_embed = ConvEmbedding(
+            in_chans, base_dims[0] * heads[0], patch_size, stride, padding
+        )
+
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 1, base_dims[0] * heads[0]],
+            dtype="float32",
+            default_initializer=trunc_normal_,
+        )
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        self.transformers = nn.LayerList([])
+        self.pools = nn.LayerList([])
+
+        for stage, stage_depth in enumerate(self.depth):
+            drop_path_prob = [
+                drop_path_rate * i / total_block
+                for i in range(block_idx, block_idx + stage_depth)
+            ]
+            block_idx += stage_depth
+
+            self.transformers.append(
+                Transformer(
+                    base_dims[stage],
+                    stage_depth,
+                    heads[stage],
+                    mlp_ratio,
+                    drop_rate,
+                    attn_drop_rate,
+                    drop_path_prob,
+                )
+            )
+            if stage < len(heads) - 1:
+                self.pools.append(
+                    ConvHeadPooling(
+                        base_dims[stage] * heads[stage],
+                        base_dims[stage + 1] * heads[stage + 1],
+                        stride=2,
+                    )
+                )
+
+        self.norm = nn.LayerNorm(base_dims[-1] * heads[-1], epsilon=1e-6)
+        self.embed_dim = base_dims[-1] * heads[-1]
+
+        # Classifier head
+        if num_classes > 0:
+            self.head = nn.Linear(base_dims[-1] * heads[-1], num_classes)
+        else:
+            self.head = nn.Identity()
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+
+        pos_embed = self.pos_embed
+        x = self.pos_drop(x + pos_embed)
+        cls_tokens = self.cls_token.expand([x.shape[0], -1, -1])
+
+        for stage, pool_layer in enumerate(self.pools):
+            x, cls_tokens = self.transformers[stage](x, cls_tokens)
+            x, cls_tokens = pool_layer(x, cls_tokens)
+        x, cls_tokens = self.transformers[-1](x, cls_tokens)
+
+        cls_tokens = self.norm(cls_tokens)
+
+        return cls_tokens
+
+    def forward(self, x):
+        cls_token = self.forward_features(x)
+        cls_token = self.head(cls_token[:, 0])
+        return cls_token
+
+
+class DistilledPoolingTransformer(PoolingTransformer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 2, self.base_dims[0] * self.heads[0]],
+            dtype="float32",
+            default_initializer=trunc_normal_,
+        )
+
+        if self.num_classes > 0:
+            self.head_dist = nn.Linear(
+                self.base_dims[-1] * self.heads[-1], self.num_classes
+            )
+        else:
+            self.head_dist = Identity()
+
+        self.head_dist.apply(self._init_weights)
+
+    def forward(self, x):
+        cls_token = self.forward_features(x)
+        x_cls = self.head(cls_token[:, 0])
+        x_dist = self.head_dist(cls_token[:, 1])
+        if self.training:
+            return x_cls, x_dist
+        return (x_cls + x_dist) / 2
+
+
+def build_pit(config):
+    if config.MODEL.DISTILL:
+        model = DistilledPoolingTransformer(
+            image_size=config.DATA.IMAGE_SIZE,
+            num_classes=config.MODEL.NUM_CLASSES,
+            patch_size=config.MODEL.TRANS.PATCH_SIZE,
+            stride=config.MODEL.TRANS.STRIDE,
+            base_dims=config.MODEL.TRANS.BASE_DIMS,
+            depth=config.MODEL.TRANS.DEPTH,
+            heads=config.MODEL.TRANS.HEADS,
+        )
+
+    else:
+        model = PoolingTransformer(
+            image_size=config.DATA.IMAGE_SIZE,
+            num_classes=config.MODEL.NUM_CLASSES,
+            patch_size=config.MODEL.TRANS.PATCH_SIZE,
+            stride=config.MODEL.TRANS.STRIDE,
+            base_dims=config.MODEL.TRANS.BASE_DIMS,
+            depth=config.MODEL.TRANS.DEPTH,
+            heads=config.MODEL.TRANS.HEADS,
+        )
+
+    return model
diff --git a/image_classification/PiT/random_erasing.py b/image_classification/PiT/random_erasing.py
new file mode 100644
index 00000000..80d31dd8
--- /dev/null
+++ b/image_classification/PiT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, inputs):
+        if len(inputs.shape) == 3:
+            self._erase(inputs, *inputs.shape, inputs.dtype)
+        else:
+            batch_size, chan, img_h, img_w = inputs.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(inputs[i], chan, img_h, img_w, inputs.dtype)
+        return inputs
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/PiT/regnet.py b/image_classification/PiT/regnet.py
new file mode 100644
index 00000000..13e1be06
--- /dev/null
+++ b/image_classification/PiT/regnet.py
@@ -0,0 +1,261 @@
+import numpy as np
+import copy
+import paddle
+import paddle.nn as nn
+
+#RegNet y-160
+#This is a simple version of regnet which only implements RegNetY-160.
+#This model is used as the teacher model for DeiT.
+
+class Identity(nn.Layer):
+    """ Identity Layer """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class SE(nn.Layer):
+    """ Squeeze and Excitation module"""
+    def __init__(self, in_channels, rd_channels, se_ratio=.25):
+        super().__init__()
+        if rd_channels is None:
+            out_channels = int(in_channels * se_ratio)
+        else:
+            out_channels = rd_channels
+        self.avgpool = nn.AdaptiveAvgPool2D(output_size=1)
+        self.conv1_1x1 = nn.Conv2D(in_channels, out_channels, kernel_size=1)
+        self.conv2_1x1 = nn.Conv2D(out_channels, in_channels, kernel_size=1)
+        self.relu = nn.ReLU()
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, x):
+        out = self.avgpool(x)
+        out = self.conv1_1x1(out)
+        out = self.relu(out)
+        out = self.conv2_1x1(out)
+        out = self.sigmoid(out)
+        out = x * out
+        return out
+
+
+class Downsample(nn.Layer):
+    """Downsample for 1st bottleneck block in every layer in RegNet"""
+    def __init__(self, in_channels, out_channels, stride):
+        super().__init__()
+        self.conv1x1 = nn.Conv2D(in_channels,
+                                 out_channels,
+                                 kernel_size=1,
+                                 stride=stride,
+                                 bias_attr=False)
+        self.bn = nn.BatchNorm2D(out_channels)
+
+    def forward(self, x):
+        out = self.conv1x1(x)
+        out = self.bn(out)
+        return out
+
+
+class Bottleneck(nn.Layer):
+    """Bottleneck residual block in Stage"""
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 bottleneck_ratio=1,
+                 group_width=1,
+                 stride=1,
+                 dilation=1,
+                 se_ratio=0.25):
+        super().__init__()
+        # 1x1 bottleneck conv block
+        bottleneck_channels = int(round(out_channels * bottleneck_ratio))
+        self.conv1 = nn.Conv2D(in_channels, bottleneck_channels, 1, bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(bottleneck_channels)
+        # 3x3 conv block with group conv
+        groups = bottleneck_channels // group_width
+        self.conv2 = nn.Conv2D(bottleneck_channels,
+                               bottleneck_channels,
+                               kernel_size=3,
+                               stride=stride,
+                               dilation=dilation,
+                               padding=1,
+                               groups=groups,
+                               bias_attr=False)
+        self.bn2 = nn.BatchNorm2D(bottleneck_channels)
+        # SE modual
+        if se_ratio:
+            self.se = SE(bottleneck_channels, rd_channels=int(round(in_channels * se_ratio)))
+        else:
+            se_ratio = Identity()
+        # downsample if stride = 2
+        if stride != 1 or in_channels != out_channels:
+            self.downsample = Downsample(in_channels, out_channels, stride)
+        else:
+            self.downsample = Identity()
+        # 1x1 conv block
+        self.conv3 = nn.Conv2D(bottleneck_channels,
+                               out_channels,
+                               kernel_size=1)
+        self.bn3 = nn.BatchNorm2D(out_channels)
+        self.relu = nn.ReLU()
+
+    def forward(self, x):
+        h = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+
+        out = self.se(out)
+
+        out = self.conv3(out)
+        out = self.bn3(out)
+
+        h = self.downsample(h)
+
+        out = out + h
+        out = self.relu(out)
+        return out
+
+
+class RegStage(nn.Layer):
+    """ Sequence of blocks with the same output shape"""
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 depth,
+                 bottleneck_ratio,
+                 group_width,
+                 se_ratio=0.25):
+        super().__init__()
+
+        self.blocks = nn.LayerList()
+        for i in range(depth):
+            block_stride = 2 if i == 0 else 1
+            block_in_channels = in_channels if i == 0 else out_channels
+            self.blocks.append(
+                copy.deepcopy(Bottleneck(block_in_channels,
+                                         out_channels,
+                                         bottleneck_ratio,
+                                         group_width,
+                                         block_stride,
+                                         se_ratio=se_ratio)))
+
+    def forward(self, x):
+        for block in self.blocks:
+            x = block(x)
+        return x
+
+
+class RegNet(nn.Layer):
+    """RegNet Model"""
+    def __init__(self, cfg):
+        super().__init__()
+        num_classes = cfg['num_classes']
+        stem_width = cfg['stem_width']
+
+        # Stem layers
+        self.stem = nn.Sequential(
+            nn.Conv2D(in_channels=3,
+                      out_channels=stem_width,
+                      kernel_size=3,
+                      stride=2,
+                      padding=1,
+                      bias_attr=False),
+            nn.BatchNorm2D(stem_width),
+            nn.ReLU())
+        # RegStages
+        self.stages = nn.LayerList()
+        prev_width = stem_width
+        curr_stride = 2
+        stage_params = self._get_stage_params(cfg)
+        for i, stage_param in enumerate(stage_params):
+            self.stages.append(
+                copy.deepcopy(RegStage(in_channels=prev_width,
+                                       out_channels=stage_param['out_channels'],
+                                       depth=stage_param['depth'],
+                                       bottleneck_ratio=stage_param['bottle_ratio'],
+                                       group_width=stage_param['group_width'],
+                                       se_ratio=stage_param['se_ratio'])))
+            prev_width = stage_param['out_channels']
+        # Head
+        num_features = prev_width
+        self.head = nn.Sequential(nn.AdaptiveAvgPool2D(output_size=1),
+                                  nn.Flatten(),
+                                  nn.Linear(num_features, num_classes))
+
+    def _get_stage_params(self, cfg):
+        w_init = cfg['w0']
+        w_slope = cfg['wa']
+        w_mult = cfg['wm']
+        depth = cfg['depth']
+        se_ratio = cfg['se_ratio']
+        group_w = cfg['group_w']
+        bottle_ratio = cfg['bottle_ratio']
+
+        w, d = self._generate_regnet(w_slope, w_init, w_mult, depth, bottle_ratio, group_w)
+
+        num_stages = len(w)
+        stage_widths = w
+        stage_depths = d
+        stage_bottle_ratios = [bottle_ratio for _ in range(num_stages)]
+        stage_groups = [group_w for _ in range(num_stages)]
+        se_ratios = [se_ratio for _ in range(num_stages)]
+        param_names = ['out_channels', 'depth', 'bottle_ratio', 'group_width','se_ratio']
+        stage_params = [
+            dict(zip(param_names, params)) for params in zip(stage_widths,
+                                                             stage_depths,
+                                                             stage_bottle_ratios,
+                                                             stage_groups,
+                                                             se_ratios)]
+        return stage_params 
+
+    def _generate_regnet(self, w_slope, w_init, w_mult, depth, b=1, g=8):
+        """Generate per block widths from RegNet parameters"""
+        w_count = w_init + w_slope * np.arange(depth) # Equation 1
+        w_exps = np.round(np.log(w_count / w_init) / np.log(w_mult)) # Equation 2
+        
+        w = w_init * np.power(w_mult, w_exps) # Equation 3
+        w = np.round(np.divide(w, 8)) * 8 # make all width list divisible by 8
+
+        w, d = np.unique(w.astype(int), return_counts=True) # find depth and width list
+
+        gtemp = np.minimum(g, w//b)
+        w = (np.round(w // b / gtemp) * gtemp).astype(int) # width
+
+        return w, d
+
+    def forward_features(self, x):
+        x = self.stem(x)
+        for stage in self.stages:
+            x = stage(x)
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+        return x
+
+
+            
+def build_regnet():
+    """build regnet model using dict as config"""
+    regnety_160 = {
+        'stem_width': 32,
+        'bottle_ratio': 1.0,
+        'w0': 200,
+        'wa': 106.23,
+        'wm': 2.48,
+        'group_w': 112,
+        'depth': 18,
+        'se_ratio': 0.25,
+        'num_classes': 1000,
+        'pool_size': (7, 7),
+        'crop_pct': 0.875,
+    }
+    model = RegNet(regnety_160)
+    return model
diff --git a/image_classification/PiT/run_eval.sh b/image_classification/PiT/run_eval.sh
new file mode 100644
index 00000000..148f121b
--- /dev/null
+++ b/image_classification/PiT/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/pit_ti_distill.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./pit_ti_distill'
diff --git a/image_classification/PiT/run_eval_multi.sh b/image_classification/PiT/run_eval_multi.sh
new file mode 100644
index 00000000..0421af4c
--- /dev/null
+++ b/image_classification/PiT/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/pit_xs.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=64 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./pit_xs'
diff --git a/image_classification/PiT/run_train.sh b/image_classification/PiT/run_train.sh
new file mode 100644
index 00000000..5e1d2aba
--- /dev/null
+++ b/image_classification/PiT/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/pit_ti_distill.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
diff --git a/image_classification/PiT/stat_define.py b/image_classification/PiT/stat_define.py
new file mode 100644
index 00000000..9f5333a5
--- /dev/null
+++ b/image_classification/PiT/stat_define.py
@@ -0,0 +1,62 @@
+import os
+import glob
+import paddle
+from config import get_config
+from pit import build_pit as build_model
+
+def count_gelu(layer, inputs, output):
+    activation_flops = 8
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, inputs, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, inputs, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+#cfg = './configs/xcit_nano_12_p8_224.yaml'
+#input_size = (1, 3, 224, 224)
+#cfg = './configs/xcit_large_24_p16_384.yaml'
+#input_size = (1, 3, 384, 384)
+#config = get_config(cfg)
+#model = build_model(config)
+
+#custom_ops = {paddle.nn.GELU: count_gelu,
+#              paddle.nn.LayerNorm: count_layernorm,
+#              paddle.nn.Softmax: count_softmax,
+#            }
+#print(os.path.basename(cfg))
+#paddle.flops(model,
+#             input_size=input_size,
+#             custom_ops=custom_ops,
+#             print_detail=False)
+
+
+for cfg in glob.glob('./configs/*.yaml'):
+    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+    print('-----------')
diff --git a/image_classification/PiT/transforms.py b/image_classification/PiT/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/PiT/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/PiT/utils.py b/image_classification/PiT/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/PiT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/PoolFormer/README.md b/image_classification/PoolFormer/README.md
new file mode 100644
index 00000000..8b4676d7
--- /dev/null
+++ b/image_classification/PoolFormer/README.md
@@ -0,0 +1,171 @@
+# PoolFormer: MetaFormer is Actually What You Need for Vision, [arxiv](https://arxiv.org/abs/2111.11418) 
+
+PaddlePaddle training/validation code and pretrained models for **PoolFormer**.
+
+The official PyTorch implementation is [here](https://github.com/sail-sg/poolformer).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./poolformer.png" alt="drawing" width="70%"/>
+<h4 align="center">PoolFormer Model Overview</h4>
+</p>
+
+
+
+### Update 
+- Update (2021-12-15): Code and weights are updated.
+- Update (2021-12-10): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model          | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|----------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| poolformer_s12 | 77.24 | 93.51 | 11.9M   | 1.8G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/15EBfTTU6coLCsDNiLgAWYiWeMpp3uYH4/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1n6TUxQGlssTu4lyLrBOXEw)(zcv4)             |
+| poolformer_s24 | 80.33 | 95.05 | 21.3M   | 3.4G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1JxqJluDpp1wwe7XtpTi1aWaVvlq0Q3xF/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1d2uyHB5R6ZWPzXWhdtm6fw)(nedr)             |
+| poolformer_s36 | 81.43 | 95.45 | 30.8M   | 5.0G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1ka3VeupDRFBSzzrcw4wHXKGqoKv6sB_Y/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1de6ZJkmYEmVI7zKUCMB_xw)(fvpm)             |
+| poolformer_m36 | 82.11 | 95.69 | 56.1M   | 8.9G   | 224        | 0.95     | bicubic       | [google](https://drive.google.com/file/d/1LTZ8wNRb_GSrJ9H3qt5-iGiGlwa4dGAK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qNTYLw4vyuoH1EKDXEcSvw)(whfp)             |
+| poolformer_m48 | 82.46 | 95.96 | 73.4M   | 11.8G  | 224        | 0.95     | bicubic       | [google](https://drive.google.com/file/d/1YhXEVjWtI4bZB_Qwama8G4RBanq2K15L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VJXANTseTUEA0E6HYf-XyA)(374f)             |
+
+> *The results are evaluated on ImageNet2012 validation set.
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./poolformer_s12.pdparams`, to use the `poolformer_s12` model in python:
+```python
+from config import get_config
+from poolformer import build_poolformer as build_model
+# config files in ./configs/
+config = get_config('./configs/poolformer_s12.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./poolformer_s12')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate PoolFormer model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/poolformer_s12.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./poolformer_s12'
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/poolformer_s12.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./poolformer_s12'
+```
+
+</details>
+
+
+## Training
+To train the PoolFormer model on ImageNet2012 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg='./configs/poolformer_s12.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/poolformer_s12.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+```
+
+</details>
+
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{yu2021metaformer,
+  title={MetaFormer is Actually What You Need for Vision},
+  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
+  journal={arXiv preprint arXiv:2111.11418},
+  year={2021}
+}
+```
+
diff --git a/image_classification/PoolFormer/__init__.py b/image_classification/PoolFormer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/PoolFormer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/PoolFormer/augment.py b/image_classification/PoolFormer/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/PoolFormer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/PoolFormer/config.py b/image_classification/PoolFormer/config.py
new file mode 100644
index 00000000..954629c5
--- /dev/null
+++ b/image_classification/PoolFormer/config.py
@@ -0,0 +1,182 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 256 # train batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 # val batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.IMAGE_CHANNELS = 3 # input image channels
+_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'PoolFormer'
+_C.MODEL.NAME = 'PoolFormer'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.LAYERS = [2, 2, 6, 2]
+_C.MODEL.TRANS.EMBED_DIMS = [64, 128, 320, 512]
+_C.MODEL.TRANS.DOWNSAMPLES = [True, True, True, True]
+_C.MODEL.TRANS.MLP_RATIOS= [4, 4, 4, 4]
+_C.MODEL.TRANS.LAYER_SCALE_INIT_VALUE = 1e-5
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 5
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 4e-3 
+_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
+_C.TRAIN.END_LR = 5e-4
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = False
+_C.TRAIN.MODEL_EMA_DECAY = 0.99992
+_C.TRAIN.LINEAR_SCALED_LR = 1024
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 42
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/PoolFormer/configs/poolformer_m36.yaml b/image_classification/PoolFormer/configs/poolformer_m36.yaml
new file mode 100644
index 00000000..29d6b937
--- /dev/null
+++ b/image_classification/PoolFormer/configs/poolformer_m36.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.95
+MODEL:
+    TYPE: poolformer
+    NAME: poolformer_m36
+    TRANS:
+        LAYERS: [6, 6, 18, 6]
+        EMBED_DIMS: [96, 192, 384, 768]
+        LAYER_SCALE_INIT_VALUE: 1e-6
+
diff --git a/image_classification/PoolFormer/configs/poolformer_m48.yaml b/image_classification/PoolFormer/configs/poolformer_m48.yaml
new file mode 100644
index 00000000..967185ba
--- /dev/null
+++ b/image_classification/PoolFormer/configs/poolformer_m48.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.95
+MODEL:
+    TYPE: poolformer
+    NAME: poolformer_m48
+    TRANS:
+        LAYERS: [8, 8, 24, 8]
+        EMBED_DIMS: [96, 192, 384, 768]
+        LAYER_SCALE_INIT_VALUE: 1e-6
+
diff --git a/image_classification/PoolFormer/configs/poolformer_s12.yaml b/image_classification/PoolFormer/configs/poolformer_s12.yaml
new file mode 100644
index 00000000..fee7de98
--- /dev/null
+++ b/image_classification/PoolFormer/configs/poolformer_s12.yaml
@@ -0,0 +1,10 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: poolformer
+    NAME: poolformer_s12
+    TRANS:
+        LAYERS: [2, 2, 6, 2]
+        EMBED_DIMS: [64, 128, 320, 512]
+
diff --git a/image_classification/PoolFormer/configs/poolformer_s24.yaml b/image_classification/PoolFormer/configs/poolformer_s24.yaml
new file mode 100644
index 00000000..039b7c4f
--- /dev/null
+++ b/image_classification/PoolFormer/configs/poolformer_s24.yaml
@@ -0,0 +1,10 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: poolformer
+    NAME: poolformer_s24
+    TRANS:
+        LAYERS: [4, 4, 12, 4]
+        EMBED_DIMS: [64, 128, 320, 512]
+
diff --git a/image_classification/PoolFormer/configs/poolformer_s36.yaml b/image_classification/PoolFormer/configs/poolformer_s36.yaml
new file mode 100644
index 00000000..415cbcb4
--- /dev/null
+++ b/image_classification/PoolFormer/configs/poolformer_s36.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: poolformer
+    NAME: poolformer_s36
+    TRANS:
+        LAYERS: [6, 6, 18, 6]
+        EMBED_DIMS: [64, 128, 320, 512]
+        LAYER_SCALE_INIT_VALUE: 1e-6
+
diff --git a/image_classification/PoolFormer/datasets.py b/image_classification/PoolFormer/datasets.py
new file mode 100644
index 00000000..241f81b7
--- /dev/null
+++ b/image_classification/PoolFormer/datasets.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    aug_op_list = []
+    # random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0)))
+    # auto_augment / color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/PoolFormer/droppath.py b/image_classification/PoolFormer/droppath.py
new file mode 100644
index 00000000..f5d3fcaa
--- /dev/null
+++ b/image_classification/PoolFormer/droppath.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor #divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+#def main():
+#    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
+#    dp = DropPath(0.5)
+#    out = dp(tmp)
+#    print(out)
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/PoolFormer/losses.py b/image_classification/PoolFormer/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/PoolFormer/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/PoolFormer/main_multi_gpu.py b/image_classification/PoolFormer/main_multi_gpu.py
new file mode 100644
index 00000000..3e81aa25
--- /dev/null
+++ b/image_classification/PoolFormer/main_multi_gpu.py
@@ -0,0 +1,590 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""PoolFormer training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from mixup import Mixup
+from config import get_config
+from config import update_config
+from poolformer import build_poolformer as build_model
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('PoolFormer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    """main method for each process"""
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+        filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+        logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 6: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 7: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    """main method for spawning multi process training/validation"""
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/PoolFormer/main_single_gpu.py b/image_classification/PoolFormer/main_single_gpu.py
new file mode 100644
index 00000000..69022755
--- /dev/null
+++ b/image_classification/PoolFormer/main_single_gpu.py
@@ -0,0 +1,422 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Poolformer training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from poolformer import build_poolformer as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('PoolFormer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+    
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/PoolFormer/mixup.py b/image_classification/PoolFormer/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/PoolFormer/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/PoolFormer/poolformer.png b/image_classification/PoolFormer/poolformer.png
new file mode 100644
index 00000000..4a4cd53d
Binary files /dev/null and b/image_classification/PoolFormer/poolformer.png differ
diff --git a/image_classification/PoolFormer/poolformer.py b/image_classification/PoolFormer/poolformer.py
new file mode 100644
index 00000000..9fd0084d
--- /dev/null
+++ b/image_classification/PoolFormer/poolformer.py
@@ -0,0 +1,428 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for PoolFormer
+"""
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from droppath import DropPath
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+
+
+class Identity(nn.Layer):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class PatchEmbed(nn.Layer):
+    """
+    Patch Embedding that is implemented by a layer of conv. 
+    Input: tensor in shape [B, C, H, W]
+    Output: tensor in shape [B, C, H/stride, W/stride]
+    """
+
+    def __init__(
+        self,
+        patch_size=16,
+        stride=16,
+        padding=0,
+        in_chans=3,
+        embed_dim=768,
+        norm_layer=None,
+    ):
+        super().__init__()
+        patch_size = (patch_size, patch_size)
+        stride = (stride, stride)
+        padding = (padding, padding)
+        self.proj = nn.Conv2D(
+            in_chans, embed_dim, kernel_size=patch_size, stride=stride, padding=padding
+        )
+        self.norm = norm_layer(embed_dim) if norm_layer else Identity()
+
+    def forward(self, x):
+        x = self.proj(x)
+        x = self.norm(x)
+        return x
+
+
+class LayerNormChannel(nn.Layer):
+    """
+    LayerNorm only for Channel Dimension.
+    Input: tensor in shape [B, C, H, W]
+    """
+
+    def __init__(self, num_channels, epsilon=1e-05):
+        super().__init__()
+        self.weight = paddle.create_parameter(
+            shape=[num_channels], dtype="float32", default_initializer=ones_
+        )
+        self.bias = paddle.create_parameter(
+            shape=[num_channels], dtype="float32", default_initializer=zeros_
+        )
+        self.epsilon = epsilon
+
+    def forward(self, x):
+        u = x.mean(1, keepdim=True)
+        s = (x - u).pow(2).mean(1, keepdim=True)
+        x = (x - u) / paddle.sqrt(s + self.eps)
+        x = self.weight.unsqueeze(-1).unsqueeze(-1) * x + self.bias.unsqueeze(
+            -1
+        ).unsqueeze(-1)
+        return x
+
+
+class GroupNorm(nn.GroupNorm):
+    """
+    Group Normalization with 1 group.
+    Input: tensor in shape [B, C, H, W]
+    """
+
+    def __init__(self, num_channels, **kwargs):
+        super().__init__(1, num_channels, **kwargs)
+
+
+class Pooling(nn.Layer):
+    """
+    Implementation of pooling for PoolFormer
+    --pool_size: pooling size
+    """
+
+    def __init__(self, kernel_size=3):
+        super().__init__()
+        self.pool = nn.AvgPool2D(
+            kernel_size, stride=1, padding=kernel_size // 2, exclusive=True
+        )
+
+    def forward(self, x):
+        return self.pool(x) - x
+
+
+class Mlp(nn.Layer):
+    """
+    Implementation of MLP with 1*1 convolutions.
+    Input: tensor with shape [B, C, H, W]
+    """
+
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        act_layer=nn.GELU,
+        drop=0.0,
+    ):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Conv2D(in_features, hidden_features, 1)
+        self.act = act_layer()
+        self.fc2 = nn.Conv2D(hidden_features, out_features, 1)
+        self.drop = nn.Dropout(drop)
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Conv2D):
+            trunc_normal_(m.weight)
+            if m.bias is not None:
+                zeros_(m.bias)
+
+    def forward(self, x):
+        x = self.fc1(x)  # (B, C, H, W) --> (B, C, H, W)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)  # (B, C, H, W) --> (B, C, H, W)
+        x = self.drop(x)
+        return x
+
+
+class PoolFormerBlock(nn.Layer):
+    """
+    Implementation of one PoolFormer block.
+    --dim: embedding dim
+    --pool_size: pooling size
+    --mlp_ratio: mlp expansion ratio
+    --act_layer: activation
+    --norm_layer: normalization
+    --drop: dropout rate
+    --drop path: Stochastic Depth, 
+        refer to https://arxiv.org/abs/1603.09382
+    --use_layer_scale, --layer_scale_init_value: LayerScale, 
+        refer to https://arxiv.org/abs/2103.17239
+    """
+
+    def __init__(
+        self,
+        dim,
+        pool_size=3,
+        mlp_ratio=4.0,
+        act_layer=nn.GELU,
+        norm_layer=GroupNorm,
+        drop=0.0,
+        drop_path=0.0,
+        use_layer_scale=True,
+        layer_scale_init_value=1e-5,
+    ):
+
+        super().__init__()
+
+        self.norm1 = norm_layer(dim)
+        self.token_mixer = Pooling(
+            kernel_size=pool_size
+        )  # vits是msa，MLPs是mlp，这个用pool来替代
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+        )
+
+        # The following two techniques are useful to train deep PoolFormers.
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+        self.use_layer_scale = use_layer_scale
+        if use_layer_scale:
+
+            self.layer_scale_1 = paddle.create_parameter(
+                shape=[dim],
+                dtype="float32",
+                default_initializer=nn.initializer.Constant(
+                    value=layer_scale_init_value
+                ),
+            )
+
+            self.layer_scale_2 = paddle.create_parameter(
+                shape=[dim],
+                dtype="float32",
+                default_initializer=nn.initializer.Constant(
+                    value=layer_scale_init_value
+                ),
+            )
+
+    def forward(self, x):
+        if self.use_layer_scale:
+            x = x + self.drop_path(
+                self.layer_scale_1.unsqueeze(-1).unsqueeze(-1)
+                * self.token_mixer(self.norm1(x))
+            )
+            x = x + self.drop_path(
+                self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) * self.mlp(self.norm2(x))
+            )
+        else:
+            x = x + self.drop_path(self.token_mixer(self.norm1(x)))
+            x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+
+def basic_blocks(
+    dim,
+    index,
+    layers,
+    pool_size=3,
+    mlp_ratio=4.0,
+    act_layer=nn.GELU,
+    norm_layer=GroupNorm,
+    drop_rate=0.0,
+    drop_path_rate=0.0,
+    use_layer_scale=True,
+    layer_scale_init_value=1e-5,
+):
+    """
+    generate PoolFormer blocks for a stage
+    return: PoolFormer blocks 
+    """
+    blocks = []
+    for block_idx in range(layers[index]):
+        block_dpr = (
+            drop_path_rate * (block_idx + sum(layers[:index])) / (sum(layers) - 1)
+        )
+        blocks.append(
+            PoolFormerBlock(
+                dim,
+                pool_size=pool_size,
+                mlp_ratio=mlp_ratio,
+                act_layer=act_layer,
+                norm_layer=norm_layer,
+                drop=drop_rate,
+                drop_path=block_dpr,
+                use_layer_scale=use_layer_scale,
+                layer_scale_init_value=layer_scale_init_value,
+            )
+        )
+    blocks = nn.Sequential(*blocks)
+
+    return blocks
+
+
+def poolformer_s12(**kwargs):
+    """
+    PoolFormer-S12 model, Params: 12M
+    --layers: [x,x,x,x], numbers of layers for the four stages
+    --embed_dims, --mlp_ratios: 
+        embedding dims and mlp ratios for the four stages
+    --downsamples: flags to apply downsampling or not in four blocks
+    """
+    layers = [2, 2, 6, 2]
+    embed_dims = [64, 128, 320, 512]
+    mlp_ratios = [4, 4, 4, 4]
+    downsamples = [True, True, True, True]
+    model = PoolFormer(
+        layers,
+        embed_dims=embed_dims,
+        mlp_ratios=mlp_ratios,
+        downsamples=downsamples,
+        **kwargs
+    )
+    return model
+
+
+class PoolFormer(nn.Layer):
+    """
+    PoolFormer, the main class of our model
+    --layers: [x,x,x,x], number of blocks for the 4 stages
+    --embed_dims, --mlp_ratios, --pool_size: the embedding dims, mlp ratios and 
+        pooling size for the 4 stages
+    --downsamples: flags to apply downsampling or not
+    --norm_layer, --act_layer: define the types of normalizaiotn and activation
+    --num_classes: number of classes for the image classification
+    --in_patch_size, --in_stride, --in_pad: specify the patch embedding
+        for the input image
+    --down_patch_size --down_stride --down_pad: 
+        specify the downsample (patch embed.)
+    """
+
+    def __init__(
+        self,
+        layers,
+        embed_dims=None,
+        mlp_ratios=None,
+        downsamples=None,
+        pool_size=3,
+        norm_layer=GroupNorm,
+        act_layer=nn.GELU,
+        num_classes=1000,
+        in_patch_size=7,
+        in_stride=4,
+        in_pad=2,
+        down_patch_size=3,
+        down_stride=2,
+        down_pad=1,
+        drop_rate=0.0,
+        drop_path_rate=0.0,
+        use_layer_scale=True,
+        layer_scale_init_value=1e-5,
+        **kwargs
+    ):
+
+        super().__init__()
+
+        self.patch_embed = PatchEmbed(
+            patch_size=in_patch_size,
+            stride=in_stride,
+            padding=in_pad,
+            in_chans=3,
+            embed_dim=embed_dims[0],
+        )
+
+        # set the main block in network
+        network = []
+        for i in range(len(layers)):
+            stage = basic_blocks(
+                embed_dims[i],
+                i,
+                layers,
+                pool_size=pool_size,
+                mlp_ratio=mlp_ratios[i],
+                act_layer=act_layer,
+                norm_layer=norm_layer,
+                drop_rate=drop_rate,
+                drop_path_rate=drop_path_rate,
+                use_layer_scale=use_layer_scale,
+                layer_scale_init_value=layer_scale_init_value,
+            )
+            network.append(stage)
+            if i >= len(layers) - 1:
+                break
+            if downsamples[i] or embed_dims[i] != embed_dims[i + 1]:
+                # downsampling between two stages
+                network.append(
+                    PatchEmbed(
+                        patch_size=down_patch_size,
+                        stride=down_stride,
+                        padding=down_pad,
+                        in_chans=embed_dims[i],
+                        embed_dim=embed_dims[i + 1],
+                    )
+                )
+
+        self.network = nn.LayerList(network)
+
+        # Classifier head
+        self.norm = norm_layer(embed_dims[-1])
+        self.head = (
+            nn.Linear(embed_dims[-1], num_classes) if num_classes > 0 else Identity()
+        )
+
+        self.apply(self.cls_init_weights)
+
+    # init for classification
+    def cls_init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                zeros_(m.bias)
+
+    def forward_embeddings(self, x):
+        x = self.patch_embed(x)
+        return x
+
+    def forward_tokens(self, x):
+        outs = []
+        for idx, block in enumerate(self.network):
+            x = block(x)
+        return x
+
+    def forward(self, x):
+        # input embedding
+        x = self.forward_embeddings(x)
+        # through backbone
+        x = self.forward_tokens(x)
+        x = self.norm(x)
+        cls_out = self.head(x.mean([-2, -1]))
+        # for image classification
+        return cls_out
+
+
+def build_poolformer(config):
+    """build poolformer model from config"""
+    model = PoolFormer(
+        num_classes=config.MODEL.NUM_CLASSES,
+        layers=config.MODEL.TRANS.LAYERS,
+        embed_dims=config.MODEL.TRANS.EMBED_DIMS,
+        downsamples=config.MODEL.TRANS.DOWNSAMPLES,
+        mlp_ratios=config.MODEL.TRANS.MLP_RATIOS,
+        layer_scale_init_value=config.MODEL.TRANS.LAYER_SCALE_INIT_VALUE
+    )
+    return model
diff --git a/image_classification/PoolFormer/random_erasing.py b/image_classification/PoolFormer/random_erasing.py
new file mode 100644
index 00000000..80d31dd8
--- /dev/null
+++ b/image_classification/PoolFormer/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, inputs):
+        if len(inputs.shape) == 3:
+            self._erase(inputs, *inputs.shape, inputs.dtype)
+        else:
+            batch_size, chan, img_h, img_w = inputs.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(inputs[i], chan, img_h, img_w, inputs.dtype)
+        return inputs
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/PoolFormer/run_eval.sh b/image_classification/PoolFormer/run_eval.sh
new file mode 100644
index 00000000..49b46765
--- /dev/null
+++ b/image_classification/PoolFormer/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/poolformer_s12.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./poolformer_s12'
\ No newline at end of file
diff --git a/image_classification/PoolFormer/run_eval_multi.sh b/image_classification/PoolFormer/run_eval_multi.sh
new file mode 100644
index 00000000..d75a098a
--- /dev/null
+++ b/image_classification/PoolFormer/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/poolformer_m48.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=128 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./poolformer_m48'
diff --git a/image_classification/PoolFormer/run_train.sh b/image_classification/PoolFormer/run_train.sh
new file mode 100644
index 00000000..92e0ed39
--- /dev/null
+++ b/image_classification/PoolFormer/run_train.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/poolformer_s12.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -amp
diff --git a/image_classification/PoolFormer/run_train_multi.sh b/image_classification/PoolFormer/run_train_multi.sh
new file mode 100644
index 00000000..06cbfa13
--- /dev/null
+++ b/image_classification/PoolFormer/run_train_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/poolformer_s12.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+#    -amp
diff --git a/image_classification/PoolFormer/utils.py b/image_classification/PoolFormer/utils.py
new file mode 100644
index 00000000..ab0345aa
--- /dev/null
+++ b/image_classification/PoolFormer/utils.py
@@ -0,0 +1,120 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/README.md b/image_classification/README.md
index 52fa227a..025a21f3 100644
--- a/image_classification/README.md
+++ b/image_classification/README.md
@@ -1,10 +1,31 @@
+English | [简体中文](./README_cn.md)
+
 # PaddleViT-Classification: Visual Transformer and MLP Models for Image Classification
 PaddlePaddle training/validation code and pretrained models for **Image Classification**.
 
 This implementation is part of [PaddleViT](https://github.com/BR-IDL/PaddleViT.git) project.
 
 ## Update 
-Update (2021-08-25): Init readme uploaded.
+* Update (2021-12-30): Add MobileViT model and multi scale sampler.
+* Update (2021-12-28): Add HvT model.
+* Update (2021-12-24): Add CvT model.
+* Update (2021-12-23): Add BoTNet model.
+* Update (2021-12-15): Add PoolFormer model.
+* Update (2021-12-09): Add HaloNet model.
+* Update (2021-12-08): Add PiT model.
+* Update (2021-12-08): Add XCiT model.
+* Update (2021-11-05): Update ConvMLP models.
+* Update (2021-11-04): Update ConvMixer models.
+* Update (2021-11-03): Update ViP models.
+* Update (2021-10-28): Add MobileViT model.
+* Update (2021-10-28): Add FocalTransformer model.
+* Update (2021-10-28): Add CycleMLP model.
+* Update (2021-10-19): Add BEiT model.
+* Update (2021-10-12): Update code for training from scratch in Swin Transformer.
+* Update (2021-09-28): Add AMP training.
+* Update (2021-09-27): Add more ported model weights.
+* Update (2021-09-09): Add FF-Only, RepMLP models.
+* Update (2021-08-25): Init readme uploaded.
 
 ## Quick Start
 
@@ -18,21 +39,39 @@ Update (2021-08-25): Init readme uploaded.
 7. **[PVTv2](./PVTv2)**
 8. **[Shuffle Transformer](./Shuffle_Transformer)**
 9. **[T2T-ViT](./T2T_ViT)**
-10. **[MLP-Mixer](./MLP-Mixer)**
-11. **[ResMLP](./ResMLP)**
-12. **[gMLP](./gMLP)**
+10. **[CrossViT](./CrossViT)**
+10. **[Focal Transformer](./Focal_Transformer)**
+11. **[BEiT](./BEiT)**
+11. **[MobileViT](./MobileViT)**
+11. **[ViP](./ViP)**
+11. **[XCiT](./XCiT)**
+11. **[PiT](./PiT)**
+11. **[HaloNet](./HaloNet)**
+11. **[PoolFormer](./PoolFormer)**
+12. **[BoTNet](./BoTNet)**
+12. **[CvT](./Cvt)**
+12. **[HvT](./HVT)**
+13. **[MLP-Mixer](./MLP-Mixer)**
+14. **[ResMLP](./ResMLP)**
+15. **[gMLP](./gMLP)**
+16. **[FF_Only](./FF_Only)**
+17. **[RepMLP](./RepMLP)**
+17. **[CycleMLP](./CycleMLP)**
+17. **[ConvMixer](./ConvMixer)**
+17. **[ConvMLP](./ConvMLP)**
 
 
 ## Installation
 This module is tested on Python3.6+, and PaddlePaddle 2.1.0+. Most dependencies are installed by PaddlePaddle installation. You only need to install the following packages:
 ```shell
-pip install yacs yaml
+pip install yacs pyyaml
 ```
 Then download the github repo:
 ```shell
 git clone https://github.com/BR-IDL/PaddleViT.git
 cd PaddleViT/image_classification
 ```
+> Note: It is recommended to install the latest version of PaddlePaddle to avoid some CUDA errors for  PaddleViT training. For PaddlePaddle, please refer to this [link](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) for stable version installation and this [link](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html#gpu) for develop version installation. 
 
 ## Basic Usage
 ### Data Preparation
@@ -63,8 +102,8 @@ from visual_transformer import build_vit as build_model
 config = get_config('./configs/vit_base_patch16_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./vit_base_patch16_224')
+# load pretrained weights
+model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 > :robot: See the README file in each model folder for detailed usages.
@@ -96,16 +135,32 @@ PaddleViT now provides the following **transfomer based models**:
 8. **[Shuffle Transformer](./Shuffle_Transformer)** (from Tencent), released with paper [Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer](https://arxiv.org/abs/2106.03650), by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
 9. **[T2T-ViT](./T2T_ViT)** (from NUS and YITU), released with paper [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
 ](https://arxiv.org/abs/2101.11986), by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.
+10. **[CrossViT](./CrossViT)** (from IBM), released with paper [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899), by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
+11. **[BEiT](./BEiT)** (from Microsoft Research), released with paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254), by Hangbo Bao, Li Dong, Furu Wei.
+12. **[Focal Transformer](./Focal_Transformer)** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+13. **[Mobile-ViT](./MobileViT)** (from Apple), released with paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178), by Sachin Mehta, Mohammad Rastegari.
+14. **[ViP](./ViP)** (from Oxford/ByteDance), released with [Visual Parser: Representing Part-whole Hierarchies with Transformers](https://arxiv.org/abs/2107.05790), by Shuyang Sun, Xiaoyu Yue, Song Bai, Philip Torr.
+15. **[XCiT](./XCiT)** (from Facebook/Inria/Sorbonne), released with paper [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681), by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.
+16. **[PiT](./PiT)** (from NAVER/Sogan University), released with paper [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302), by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.
+17. **[HaloNet](./HaloNet)**, (from Google), released with paper [Scaling Local Self-Attention for Parameter Efficient Visual Backbones](https://arxiv.org/abs/2103.12731), by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.11. 
+18. **[PoolFormer](./PoolFormer)**, (from Sea AI Lab/NUS), released with paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418), by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan.
+19. **[BoTNet](./BoTNet)**, (from UC Berkeley/Google), released with paper [Bottleneck Transformers for Visual Recognition](https://arxiv.org/abs/2101.11605), by Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani.
+20. **[CvT](./Cvt)** (from McGill/Microsoft), released with paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808), by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
+21. **[HvT](./HVT)** (from Monash University), released with paper [Scalable Vision Transformers with Hierarchical Pooling](https://arxiv.org/abs/2103.10619), by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.
+
 
 PaddleViT now provides the following **MLP based models**:
 1. **[MLP-Mixer](./MLP-Mixer)** (from Google), released with paper [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601), by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
 2. **[ResMLP](./ResMLP)** (from Facebook/Sorbonne/Inria/Valeo), released with paper [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404), by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
 3. **[gMLP](./gMLP)** (from Google), released with paper [Pay Attention to MLPs](https://arxiv.org/abs/2105.08050), by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
+4. **[FF Only](./FF_Only)** (from Oxford), released with paper [Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet](https://arxiv.org/abs/2105.02723), by Luke Melas-Kyriazi.
+5. **[RepMLP](./RepMLP)** (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883), by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
+6. **[CycleMLP](./CycleMLP)** (from HKU/SenseTime), released with paper [CycleMLP: A MLP-like Architecture for Dense Prediction](https://arxiv.org/abs/2107.10224), by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
+7. **[ConvMixer](./ConvMixer)** (from Anonymous), released with [Patches Are All You Need?](https://openreview.net/forum?id=TVHS5Y4dNvM), by Anonymous.
+8. **[ConvMLP](./ConvMLP)** (from UO/UIUC/PAIR), released with [ConvMLP: Hierarchical Convolutional MLPs for Vision](https://arxiv.org/abs/2109.04454), by Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi.
 
 #### Coming Soon: ####
-1. **[CrossViT]()** (from IBM), released with paper [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899), by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
-2. **[Focal Transformer]()** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
-3. **[HaloNet]()**, (from Google), released with paper [Scaling Local Self-Attention for Parameter Efficient Visual Backbones](https://arxiv.org/abs/2103.12731), by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
+1. **[DynamicViT]()** (from Tsinghua/UCLA/UW), released with paper [DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification](https://arxiv.org/abs/2106.02034), by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh.
 
 
 ## Contact
diff --git a/image_classification/README_cn.md b/image_classification/README_cn.md
new file mode 100644
index 00000000..4bf06982
--- /dev/null
+++ b/image_classification/README_cn.md
@@ -0,0 +1,168 @@
+简体中文 | [English](./README.md)
+
+# PaddleViT-Classification:图像分类领域的Visual Transformer 和 MLP 模型
+PaddlePaddle用于图像分类的训练/评估代码和预训练模型。
+
+此实现是 [PaddleViT](https://github.com/BR-IDL/PaddleViT.git) 项目的一部分.
+
+## 更新 
+* 更新 (2021-12-30): 添加 MobileViT 模型和 multi scale sampler.
+* 更新 (2021-12-28): 添加 HvT 模型.
+* 更新 (2021-12-24): 添加 CvT 模型.
+* 更新 (2021-12-23): 添加 BoTNet 模型.
+* 更新 (2021-12-15): 添加 PoolFormer 模型.
+* 更新 (2021-12-09): 添加 HaloNet 模型.
+* 更新 (2021-12-08): 添加 PiT 模型.
+* 更新 (2021-12-08): 添加 XCiT 模型.
+* 更新 (2021-11-05): 更新 ConvMLP 模型.
+* 更新 (2021-11-04): 更新 ConvMixer 模型.
+* 更新 (2021-11-03): 更新 ViP 模型.
+* 更新 (2021-10-28): 添加 MobileViT 模型.
+* 更新 (2021-10-28): 添加 FocalTransformer 模型.
+* 更新 (2021-10-28): 添加 CycleMLP 模型.
+* 更新 (2021-10-19): 添加 BEiT model.
+* 更新 (2021-10-12): 更新 Swin Transformer中从头开始训练的代码.
+* 更新 (2021-09-28): 增加 AMP 训练.
+* 更新 (2021-09-27): 添加更多ported model 权重.
+* 更新 (2021-09-09): 添加 FF-Only, RepMLP 模型.
+* 更新 (2021-08-25): 上传初始化readme.
+
+## Quick Start
+以下链接提供了每个模型架构的代码以及详细用法：
+1. **[ViT](./ViT)**
+2. **[DeiT](./DeiT)**
+3. **[Swin](./SwinTransformer)**
+4. **[VOLO](./VOLO)**
+5. **[CSwin](./CSwin)**
+6. **[CaiT](./CaiT)**
+7. **[PVTv2](./PVTv2)**
+8. **[Shuffle Transformer](./Shuffle_Transformer)**
+9. **[T2T-ViT](./T2T_ViT)**
+10. **[CrossViT](./CrossViT)**
+10. **[Focal Transformer](./Focal_Transformer)**
+11. **[BEiT](./BEiT)**
+11. **[MobileViT](./MobileViT)**
+11. **[ViP](./ViP)**
+11. **[XCiT](./XCiT)**
+11. **[PiT](./PiT)**
+11. **[HaloNet](./HaloNet)**
+12. **[PoolFormer](./PoolFormer)**
+12. **[BoTNet](./BoTNet)**
+12. **[CvT](./Cvt)**
+12. **[HvT](./HVT)**
+13. **[MLP-Mixer](./MLP-Mixer)**
+14. **[ResMLP](./ResMLP)**
+15. **[gMLP](./gMLP)**
+16. **[FF_Only](./FF_Only)**
+17. **[RepMLP](./RepMLP)**
+17. **[CycleMLP](./CycleMLP)**
+17. **[ConvMixer](./ConvMixer)**
+17. **[ConvMLP](./ConvMLP)**
+
+
+## 安装
+该模块在 Python3.6+ 和 PaddlePaddle 2.1.0+ 上进行了测试，多数依赖项通过PaddlePaddle安装。 您只需要安装以下包：
+```shell
+pip install yacs pyyaml
+```
+然后，下载github repo:
+```shell
+git clone https://github.com/BR-IDL/PaddleViT.git
+cd PaddleViT/image_classification
+```
+> 注意：建议安装最新版本的PaddlePaddle以避免PaddleViT训练时出现一些CUDA错误。PaddlePaddle 稳定版本安装请参考 [link](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) 和 [link](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html#gpu) 用于开发版本安装. 
+
+## 基本用法
+### 数据准备
+ImageNet2012 数据集用于以下文件结构:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+### Demo 示例
+如果需要使用具有预训练权重的模型，请转到特定子文件夹，然后下载 `.pdparam` 权重文件，并在以下python脚本中更改相关文件路径，模型配置文件位于 `./configs/`.  
+
+假设下载的权重文件存储在`./vit_base_patch16_224.pdparams`中，在python中使用`vit_base_patch16_224`模型：
+
+```python
+from config import get_config
+from visual_transformer import build_vit as build_model
+# config files in ./configs/
+config = get_config('./configs/vit_base_patch16_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
+model.set_dict(model_state_dict)
+```
+> 详细用法详见各模型文件夹中的README文件。
+
+## 基本概念
+PaddleViT图像分类模块是以相似结构在单独的文件夹中为每一个模型开发的，每个实现大约有3种类型的类和2种类型的脚本：
+1. **Model classes** 例如 **[transformer.py](./ViT/transformer.py)**, 其中定义了核心的 *transformer model* 和相关方法.
+   
+2. **Dataset classes** 例如 **[dataset.py](./ViT/datasets.py)**, 其中定义了 dataset, dataloader, data transforms. 我们提供了自定义数据加载的实现方式，并且支持单GPU和多GPU加载。
+   
+3. **Config classes** 例如 **[config.py](./ViT/config.py)**, 其中定义了模型训练/验证的配置. 通常不需要更改配置中的项目，我们通过python `arguments` 或者 `.yaml` 配置文件来更新配置。 您可以在 [here](../docs/ppvit-config.md) 查看配置设计和使用的详细信息.
+   
+4. **main scripts** 例如 **[main_single_gpu.py](./ViT/main_single_gpu.py)**, 其中定义了整个训练/验证程序，提供了训练或者验证的主要步骤，例如日志记录、加载/保存模型、微调等. 多GPU在单独的python 脚本 `main_multi_gpu.py`中实现.
+   
+5. **run scripts** 例如 **[run_eval_base_224.sh](./ViT/run_eval_base_224.sh)**, 其中定义了使用特定配置和参数运行python脚本的shell命令.
+   
+
+## 模型架构
+
+PaddleViT 目前支持以下 **transfomer based models**:
+1. **[ViT](./ViT)** (from Google), released with paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929), by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+2. **[DeiT](./DeiT)** (from Facebook and Sorbonne), released with paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877), by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
+3. **[Swin Transformer](./SwinTransformer)** (from Microsoft), released with paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030), by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
+4. **[VOLO](./VOLO)** (from Sea AI Lab and NUS), released with paper [VOLO: Vision Outlooker for Visual Recognition](https://arxiv.org/abs/2106.13112), by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.
+5. **[CSwin Transformer](./CSwin)** (from USTC and Microsoft), released with paper [CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
+](https://arxiv.org/abs/2107.00652), by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
+6. **[CaiT](./CaiT)** (from Facebook and Sorbonne), released with paper [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239), by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.
+7. **[PVTv2](./PVTv2)** (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper [PVTv2: Improved Baselines with Pyramid Vision Transformer](https://arxiv.org/abs/2106.13797), by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
+8. **[Shuffle Transformer](./Shuffle_Transformer)** (from Tencent), released with paper [Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer](https://arxiv.org/abs/2106.03650), by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
+9. **[T2T-ViT](./T2T_ViT)** (from NUS and YITU), released with paper [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
+](https://arxiv.org/abs/2101.11986), by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.
+10. **[CrossViT](./CrossViT)** (from IBM), released with paper [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899), by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
+11. **[BEiT](./BEiT)** (from Microsoft Research), released with paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254), by Hangbo Bao, Li Dong, Furu Wei.
+12. **[Focal Transformer](./Focal_Transformer)** (from Microsoft), released with paper [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/abs/2107.00641), by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
+13. **[Mobile-ViT](./MobileViT)** (from Apple), released with paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178), by Sachin Mehta, Mohammad Rastegari.
+14. **[ViP](./ViP)** (from Oxford/ByteDance), released with [Visual Parser: Representing Part-whole Hierarchies with Transformers](https://arxiv.org/abs/2107.05790), by Shuyang Sun, Xiaoyu Yue, Song Bai, Philip Torr.
+15. **[XCiT](./XCiT)** (from Facebook/Inria/Sorbonne), released with paper [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681), by Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou.
+16. **[PiT](./PiT)** (from NAVER/Sogan University), released with paper [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302), by Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh.
+17. **[HaloNet](./HaloNet)**, (from Google), released with paper [Scaling Local Self-Attention for Parameter Efficient Visual Backbones](https://arxiv.org/abs/2103.12731), by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.
+18. **[PoolFormer](./PoolFormer)**, (from Sea AI Lab/NUS), released with paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418), by Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan.
+19. **[BoTNet](./BoTNet)**, (from UC Berkeley/Google), released with paper [Bottleneck Transformers for Visual Recognition](https://arxiv.org/abs/2101.11605), by Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani.
+20. **[CvT](./Cvt)** (from McGill/Microsoft), released with paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808), by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
+21. **[HvT](./HVT)** (from Monash University), released with paper [Scalable Vision Transformers with Hierarchical Pooling](https://arxiv.org/abs/2103.10619), by Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai.
+
+
+PaddleViT 目前支持以下 **MLP based models**:
+1. **[MLP-Mixer](./MLP-Mixer)** (from Google), released with paper [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601), by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
+2. **[ResMLP](./ResMLP)** (from Facebook/Sorbonne/Inria/Valeo), released with paper [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404), by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
+3. **[gMLP](./gMLP)** (from Google), released with paper [Pay Attention to MLPs](https://arxiv.org/abs/2105.08050), by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.
+4. **[FF Only](./FF_Only)** (from Oxford), released with paper [Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet](https://arxiv.org/abs/2105.02723), by Luke Melas-Kyriazi.
+5. **[RepMLP](./RepMLP)** (from BNRist/Tsinghua/MEGVII/Aberystwyth), released with paper [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883), by Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding.
+6. **[CycleMLP](./CycleMLP)** (from HKU/SenseTime), released with paper [CycleMLP: A MLP-like Architecture for Dense Prediction](https://arxiv.org/abs/2107.10224), by Shoufa Chen, Enze Xie, Chongjian Ge, Ding Liang, Ping Luo.
+7. **[ConvMixer](./ConvMixer)** (from Anonymous), released with [Patches Are All You Need?](https://openreview.net/forum?id=TVHS5Y4dNvM), by Anonymous.
+8. **[ConvMLP](./ConvMLP)** (from UO/UIUC/PAIR), released with [ConvMLP: Hierarchical Convolutional MLPs fo
+
+
+#### 即将更新: ####
+1. **[DynamicViT]()** (from Tsinghua/UCLA/UW), released with paper [DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification](https://arxiv.org/abs/2106.02034), by Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh.
+
+
+## Contact
+如果您有任何问题, 请在我们的Github上创建一个[issue](https://github.com/BR-IDL/PaddleViT/issues).
diff --git a/image_classification/RepMLP/README.md b/image_classification/RepMLP/README.md
new file mode 100644
index 00000000..ff113e8e
--- /dev/null
+++ b/image_classification/RepMLP/README.md
@@ -0,0 +1,176 @@
+# RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, [arxiv](https://arxiv.org/abs/2105.01883) 
+
+PaddlePaddle training/validation code and pretrained models for **RepMLP**.
+
+The official pytorch implementation is [here](https://github.com/DingXiaoH/RepMLP).
+
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./repmlp.png" alt="drawing" width="80%" height="80%"/>
+    <h4 align="center">RepMLP Model Overview</h4>
+</p>
+
+
+
+
+
+### Update 
+- Update (2021-09-27): Model FLOPs and # params are uploaded.
+- Update (2021-09-14): Code is released and ported weights are uploaded.
+
+## Models Zoo
+
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| repmlp_res50_light_224 		| 77.01 | 93.46 | 87.1M   | 3.3G   | 224   	    | 0.875    | bicubic       | [google](https://drive.google.com/file/d/16bCFa-nc_-tPVol-UCczrrDO_bCFf2uM/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1bzmpS6qJJTsOq3SQE7IOyg)(b4fg) |
+
+
+> *The results are evaluated on ImageNet2012 validation set.
+>
+> Note: RepMLP weights are ported from [here](https://github.com/DingXiaoH/RepMLP).
+
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./RepMLP-Res50-light-224_train.pdparams`, to use the `RepMLP-Res50-light-224_train` model in python:
+```python
+from config import get_config
+from resmlp_resnet import build_resmlp_resnet as build_model
+# config files in ./configs/
+config = get_config('./configs/repmlpres50_light_224_train.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./RepMLP-Res50-light-224_train.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate ResMLP model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/repmlpres50_light_224_train.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=128 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/RepMLP-Res50-light-224_train  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+    -cfg=./configs/repmlpres50_light_224_train.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/RepMLP-Res50-light-224_train  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the ResMLP Transformer model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/repmlpres50_light_224_train.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=32 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+<details>
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/repmlpres50_light_224_train.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@article{ding2021repmlp,
+title={RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition},
+author={Ding, Xiaohan and Xia, Chunlong and Zhang, Xiangyu and Chu, Xiaojie and Han, Jungong and Ding, Guiguang},
+journal={arXiv preprint arXiv:2105.01883},
+year={2021}
+}@article{melaskyriazi2021doyoueven,
+  title={Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet},
+  author={Luke Melas-Kyriazi},
+  journal=arxiv,
+  year=2021
+}
+```
diff --git a/image_classification/RepMLP/__init__.py b/image_classification/RepMLP/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/RepMLP/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/RepMLP/augment.py b/image_classification/RepMLP/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/RepMLP/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/RepMLP/config.py b/image_classification/RepMLP/config.py
new file mode 100644
index 00000000..cf2e580d
--- /dev/null
+++ b/image_classification/RepMLP/config.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'RepMLP_ResNet'
+_C.MODEL.NAME = 'repmlpres50_light_224_train'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+# transformer settings
+_C.MODEL.MIXER = CN()
+_C.MODEL.MIXER.NUM_BLOCKS=[3,4,6,3]
+_C.MODEL.MIXER.BLOCK_TYPE='light'
+_C.MODEL.MIXER.IMG_H=224
+_C.MODEL.MIXER.IMG_W=224
+_C.MODEL.MIXER.H=7
+_C.MODEL.MIXER.W=7
+_C.MODEL.MIXER.REPARAM_CONV_K=(1,3,5)
+_C.MODEL.MIXER.FC1_FC2_REDUCTION=1
+_C.MODEL.MIXER.FC3_GROUPS=4
+_C.MODEL.MIXER.DEPLOY=False
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 20 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/RepMLP/configs/repmlpres50_light_224_train.yaml b/image_classification/RepMLP/configs/repmlpres50_light_224_train.yaml
new file mode 100644
index 00000000..e837967c
--- /dev/null
+++ b/image_classification/RepMLP/configs/repmlpres50_light_224_train.yaml
@@ -0,0 +1,12 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: RepMLP_ResNet
+    NAME: repmlpres50_light_224_train
+    MIXER:
+        NUM_BLOCKS: [3,4,6,3]
+        BLOCK_TYPE: 'light'
+        DEPLOY: False
+        
+
diff --git a/image_classification/RepMLP/convert.py b/image_classification/RepMLP/convert.py
new file mode 100644
index 00000000..cb790094
--- /dev/null
+++ b/image_classification/RepMLP/convert.py
@@ -0,0 +1,33 @@
+import argparse
+import paddle
+import os
+from repmlp import repmlp_model_convert
+from config import get_config
+from repmlp_resnet import build_repmlp_resnet as build_model
+
+parser = argparse.ArgumentParser(description='RepMLP_ResNet Conversion')
+parser.add_argument('--load_path', help='path to the weights file')
+parser.add_argument('--save_path', help='path to the weights file')
+parser.add_argument('--arch', default='RepMLP-Res50-light-224', help='convert architecture')
+
+def convert():
+    args = parser.parse_args()
+    if args.arch == 'RepMLP-Res50-light-224':
+        config = get_config('./configs/repmlpres50_light_224_train.yaml')
+        train_model = build_model(config)
+    else:
+        raise ValueError('TODO')
+
+    if os.path.isfile(args.load_path):
+        print("=> loading checkpoint '{}'".format(args.load_path))
+        train_model.set_state_dict(paddle.load(args.load_path))
+        print("=> loading done")
+    else:
+        print("=> no checkpoint found at '{}'".format(args.load))
+
+    print("=> convert training to deploy ...")
+    repmlp_model_convert(train_model, save_path=args.save_path)
+
+
+if __name__ == '__main__':
+    convert()
\ No newline at end of file
diff --git a/image_classification/RepMLP/datasets.py b/image_classification/RepMLP/datasets.py
new file mode 100644
index 00000000..304df9a3
--- /dev/null
+++ b/image_classification/RepMLP/datasets.py
@@ -0,0 +1,222 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/RepMLP/droppath.py b/image_classification/RepMLP/droppath.py
new file mode 100644
index 00000000..c8fe8048
--- /dev/null
+++ b/image_classification/RepMLP/droppath.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/RepMLP/losses.py b/image_classification/RepMLP/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/RepMLP/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/RepMLP/main_multi_gpu.py b/image_classification/RepMLP/main_multi_gpu.py
new file mode 100644
index 00000000..09ca1426
--- /dev/null
+++ b/image_classification/RepMLP/main_multi_gpu.py
@@ -0,0 +1,581 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""RepMLP training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from repmlp_resnet import build_repmlp_resnet as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('RepMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg
+        train_acc_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/RepMLP/main_single_gpu.py b/image_classification/RepMLP/main_single_gpu.py
new file mode 100644
index 00000000..2e919da1
--- /dev/null
+++ b/image_classification/RepMLP/main_single_gpu.py
@@ -0,0 +1,423 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""RepMLP training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from repmlp_resnet import build_repmlp_resnet as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('RepMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip)
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/RepMLP/mixup.py b/image_classification/RepMLP/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/RepMLP/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/RepMLP/random_erasing.py b/image_classification/RepMLP/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/RepMLP/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/RepMLP/repmlp.png b/image_classification/RepMLP/repmlp.png
new file mode 100644
index 00000000..ad947493
Binary files /dev/null and b/image_classification/RepMLP/repmlp.png differ
diff --git a/image_classification/RepMLP/repmlp.py b/image_classification/RepMLP/repmlp.py
new file mode 100644
index 00000000..8fcf5ed1
--- /dev/null
+++ b/image_classification/RepMLP/repmlp.py
@@ -0,0 +1,350 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for RepMLP
+"""
+
+import copy
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+
+
+def repeat_interleave(x, arg):
+    """Use numpy to implement repeat operations"""
+    return paddle.to_tensor(x.numpy().repeat(arg))
+
+
+class Identity(nn.Layer):
+    """Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+def fuse_bn(conv_or_fc, bn):
+    """Fusion of BN weights"""
+    std = (bn._variance + bn._epsilon).sqrt()
+    t = bn.weight / std
+    if conv_or_fc.weight.ndim == 4:
+        t = t.reshape([-1, 1, 1, 1])
+    else:
+        t = t.reshape([-1, 1])
+    return conv_or_fc.weight * t, bn.bias - bn._mean * bn.weight / std
+
+
+class RepMLP(nn.Layer):
+    """RepMLP Layer
+
+    The RepMLP consists of three parts: Global Perceptron, Partition Perceptron, Local Perceptron.
+    When deploy is True, the training weight of Local Perceptron is integrated into the full connection
+    layer of part of Partition Perceptron, In order to improve the ability of representation.
+    """
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        H,
+        W,
+        h,
+        w,
+        reparam_conv_k=None,
+        fc1_fc2_reduction=1,
+        fc3_groups=1,
+        deploy=False,
+    ):
+        super().__init__()
+
+        self.C = in_channels
+        self.O = out_channels
+        self.fc3_groups = fc3_groups
+
+        self.H, self.W, self.h, self.w = H, W, h, w
+
+        self.h_parts = self.H // self.h
+        self.w_parts = self.W // self.w
+
+        assert self.H % self.h == 0
+        assert self.W % self.w == 0
+        self.target_shape = (-1, self.O, self.H, self.W)
+
+        self.deploy = deploy
+
+        self.need_global_perceptron = (H != h) or (W != w)
+        if self.need_global_perceptron:
+            internal_neurons = int(
+                self.C * self.h_parts * self.w_parts // fc1_fc2_reduction
+            )
+            self.fc1_fc2 = nn.Sequential()
+            self.fc1_fc2.add_sublayer(
+                "fc1", nn.Linear(self.C * self.h_parts * self.w_parts, internal_neurons)
+            )
+            self.fc1_fc2.add_sublayer("relu", nn.ReLU())
+            self.fc1_fc2.add_sublayer(
+                "fc2", nn.Linear(internal_neurons, self.C * self.h_parts * self.w_parts)
+            )
+            if deploy:
+                self.avg = nn.AvgPool2D(kernel_size=(self.h, self.w))
+            else:
+                self.avg = nn.Sequential()
+                self.avg.add_sublayer("avg", nn.AvgPool2D(kernel_size=(self.h, self.w)))
+                self.avg.add_sublayer("bn", nn.BatchNorm2D(num_features=self.C))
+
+        self.fc3 = nn.Conv2D(
+            self.C * self.h * self.w,
+            self.O * self.h * self.w,
+            1,
+            1,
+            0,
+            bias_attr=deploy,
+            groups=fc3_groups,
+        )
+        self.fc3_bn = Identity() if deploy else nn.BatchNorm1D(self.O * self.h * self.w)
+
+        self.reparam_conv_k = reparam_conv_k
+        if not deploy and reparam_conv_k is not None:
+            for k in reparam_conv_k:
+                conv_branch = nn.Sequential()
+                conv_branch.add_sublayer(
+                    "conv",
+                    nn.Conv2D(
+                        in_channels=self.C,
+                        out_channels=self.O,
+                        kernel_size=k,
+                        padding=k // 2,
+                        bias_attr=False,
+                        groups=fc3_groups,
+                    ),
+                )
+                conv_branch.add_sublayer("bn", nn.BatchNorm2D(self.O))
+                self.__setattr__("repconv{}".format(k), conv_branch)
+
+    def forward(self, inputs):
+
+        if self.need_global_perceptron:
+            v = self.avg(inputs)
+            v = v.reshape([-1, self.C * self.h_parts * self.w_parts])
+            v = self.fc1_fc2(v)
+            v = v.reshape([-1, self.C, self.h_parts, 1, self.w_parts, 1])
+            inputs = inputs.reshape(
+                [-1, self.C, self.h_parts, self.h, self.w_parts, self.w]
+            )
+            inputs = inputs + v
+        else:
+            inputs = inputs.reshape(
+                [-1, self.C, self.h_parts, self.h, self.w_parts, self.w]
+            )
+
+        # N, h_parts, w_parts, C, in_h, in_w
+        partitions = inputs.transpose([0, 2, 4, 1, 3, 5])
+
+        #   Feed partition map into Partition Perceptron
+        fc3_inputs = partitions.reshape([-1, self.C * self.h * self.w, 1, 1])
+        fc3_out = self.fc3(fc3_inputs)
+        fc3_out = fc3_out.reshape([-1, self.O * self.h * self.w])
+        fc3_out = self.fc3_bn(fc3_out)
+        fc3_out = fc3_out.reshape(
+            [-1, self.h_parts, self.w_parts, self.O, self.h, self.w]
+        )
+
+        #   Feed partition map into Local Perceptron
+        if self.reparam_conv_k is not None and not self.deploy:
+            conv_inputs = partitions.reshape([-1, self.C, self.h, self.w])
+            conv_out = 0
+            for k in self.reparam_conv_k:
+                conv_branch = self.__getattr__("repconv{}".format(k))
+                conv_out += conv_branch(conv_inputs)
+            conv_out = conv_out.reshape(
+                [-1, self.h_parts, self.w_parts, self.O, self.h, self.w]
+            )
+            fc3_out += conv_out
+
+        # N, O, h_parts, out_h, w_parts, out_w
+        fc3_out = fc3_out.transpose([0, 3, 1, 4, 2, 5])
+        out = fc3_out.reshape([*self.target_shape])
+        return out
+
+    def _convert_conv_to_fc(self, conv_kernel, conv_bias):
+        I = (
+            paddle.eye(self.C * self.h * self.w // self.fc3_groups)
+            .tile(repeat_times=[1, self.fc3_groups])
+            .reshape(
+                [self.C * self.h * self.w // self.fc3_groups, self.C, self.h, self.w]
+            )
+        )
+        fc_k = F.conv2d(
+            I, conv_kernel, padding=conv_kernel.shape[2] // 2, groups=self.fc3_groups
+        )
+        fc_k = fc_k.reshape(
+            [self.O * self.h * self.w // self.fc3_groups, self.C * self.h * self.w]
+        ).t()
+        fc_bias = repeat_interleave(conv_bias, self.h * self.w)
+        return fc_k, fc_bias
+
+    def get_equivalent_fc1_fc3_params(self):
+        fc_weight, fc_bias = fuse_bn(self.fc3, self.fc3_bn)
+
+        if self.reparam_conv_k is not None:
+            largest_k = max(self.reparam_conv_k)
+            largest_branch = self.__getattr__("repconv{}".format(largest_k))
+            total_kernel, total_bias = fuse_bn(largest_branch.conv, largest_branch.bn)
+            for k in self.reparam_conv_k:
+                if k != largest_k:
+                    k_branch = self.__getattr__("repconv{}".format(k))
+                    kernel, bias = fuse_bn(k_branch.conv, k_branch.bn)
+                    total_kernel += F.pad(kernel, [(largest_k - k) // 2] * 4)
+                    total_bias += bias
+
+            rep_weight, rep_bias = self._convert_conv_to_fc(total_kernel, total_bias)
+            final_fc3_weight = rep_weight.reshape(fc_weight.shape) + fc_weight
+            final_fc3_bias = rep_bias + fc_bias
+
+        else:
+            final_fc3_weight = fc_weight
+            final_fc3_bias = fc_bias
+
+        #   ------------------------------- remove BN after avg
+        if self.need_global_perceptron:
+            avgbn = self.avg.bn
+            std = (avgbn._variance + avgbn._epsilon).sqrt()
+            scale = avgbn.weight / std
+            avgbias = avgbn.bias - avgbn._mean * scale
+            fc1 = self.fc1_fc2.fc1
+            replicate_times = fc1.weight.shape[0] // len(avgbias)
+            replicated_avgbias = repeat_interleave(avgbias, replicate_times).reshape(
+                [-1, 1]
+            )
+            bias_diff = fc1.weight.matmul(replicated_avgbias).squeeze()
+            fc1_bias_new = fc1.bias + bias_diff
+            fc1_weight_new = fc1.weight * repeat_interleave(
+                scale, replicate_times
+            ).reshape([1, -1])
+        else:
+            fc1_bias_new = None
+            fc1_weight_new = None
+
+        return fc1_weight_new, fc1_bias_new, final_fc3_weight, final_fc3_bias
+
+    def switch_to_deploy(self):
+        self.deploy = True
+        (
+            fc1_weight,
+            fc1_bias,
+            fc3_weight,
+            fc3_bias,
+        ) = self.get_equivalent_fc1_fc3_params()
+        #   Remove Local Perceptron
+        if self.reparam_conv_k is not None:
+            for k in self.reparam_conv_k:
+                self.__delattr__("repconv{}".format(k))
+        #   Remove the BN after FC3
+        self.__delattr__("fc3")
+        self.__delattr__("fc3_bn")
+        self.fc3 = nn.Conv2D(
+            self.C * self.h * self.w,
+            self.O * self.h * self.w,
+            1,
+            1,
+            0,
+            bias_attr=True,
+            groups=self.fc3_groups,
+        )
+        self.fc3_bn = Identity()
+        #   Remove the BN after AVG
+        if self.need_global_perceptron:
+            self.__delattr__("avg")
+            self.avg = nn.AvgPool2D(kernel_size=(self.h, self.w))
+        #   Set values
+        if fc1_weight is not None:
+            self.fc1_fc2.fc1.weight.set_value(fc1_weight)
+            self.fc1_fc2.fc1.bias.set_value(fc1_bias)
+        self.fc3.weight.set_value(fc3_weight)
+        self.fc3.bias.set_value(fc3_bias)
+
+
+def repmlp_model_convert(model, save_path=None, do_copy=True):
+    """reparameterizing model
+
+    Args:
+        model (nn.Layer): origin model
+        save_path (str): save the model . Defaults to None.
+        do_copy (bool): copy origin model. Defaults to True.
+
+    Returns:
+        nn.Layer: The reparameterized model
+    """
+    if do_copy:
+        model = copy.deepcopy(model)
+    for module in model.sublayers():
+        if hasattr(module, "switch_to_deploy"):
+            module.switch_to_deploy()
+    if save_path is not None:
+        paddle.save(model.state_dict(), save_path)
+    return model
+
+
+def TestRepMLP():
+    # print('=== Test training_to_deploy for RepMLP ===')
+    uniform_ = paddle.nn.initializer.Uniform(low=0, high=0.1, name=None)
+    N = 1
+    C = 8
+    H = 14
+    W = 14
+    h = 7
+    w = 7
+    O = 8
+    groups = 4
+
+    x = paddle.randn([N, C, H, W])
+    # print("input shape:", x.shape)
+    repmlp = RepMLP(
+        C,
+        O,
+        H=H,
+        W=W,
+        h=h,
+        w=w,
+        reparam_conv_k=(1, 3, 5),
+        fc1_fc2_reduction=1,
+        fc3_groups=groups,
+        deploy=False,
+    )
+    repmlp.eval()
+
+    for module in repmlp.sublayers():
+        if isinstance(module, nn.BatchNorm2D) or isinstance(module, nn.BatchNorm1D):
+            uniform_(module._mean)
+            uniform_(module._variance)
+            uniform_(module.weight)
+            uniform_(module.bias)
+
+    out = repmlp(x)
+    repmlp.switch_to_deploy()
+    deployout = repmlp(x)
+    print("difference between the outputs of the training-time and converted RepMLP is")
+    print(((deployout - out) ** 2).sum().numpy().item())
+
+
+if __name__ == "__main__":
+    TestRepMLP()
diff --git a/image_classification/RepMLP/repmlp_resnet.py b/image_classification/RepMLP/repmlp_resnet.py
new file mode 100644
index 00000000..7ee74184
--- /dev/null
+++ b/image_classification/RepMLP/repmlp_resnet.py
@@ -0,0 +1,484 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for RepMLP
+"""
+
+import paddle
+import paddle.nn.functional as F
+from paddle import nn
+
+from repmlp import Identity, RepMLP, fuse_bn, repmlp_model_convert
+
+
+class ConvBN(nn.Layer):
+    """Conv + BN"""
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride=1,
+        padding=0,
+        groups=1,
+        deploy=False,
+        nonlinear=None,
+    ):
+        super().__init__()
+
+        if nonlinear is None:
+            self.nonlinear = Identity()
+        else:
+            self.nonlinear = nonlinear
+        if deploy:
+            self.conv = nn.Conv2D(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                padding=padding,
+                groups=groups,
+                bias_attr=True,
+            )
+        else:
+            self.conv = nn.Conv2D(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                padding=padding,
+                groups=groups,
+                bias_attr=False,
+            )
+            self.bn = nn.BatchNorm2D(num_features=out_channels)
+
+    def forward(self, x):
+        if hasattr(self, "bn"):
+            return self.nonlinear(self.bn(self.conv(x)))
+        else:
+            return self.nonlinear(self.conv(x))
+
+    def switch_to_deploy(self):
+        kernel, bias = fuse_bn(self.conv, self.bn)
+        conv = nn.Conv2D(
+            in_channels=self.conv._in_channels,
+            out_channels=self.conv._out_channels,
+            kernel_size=self.conv._kernel_size,
+            stride=self.conv._stride,
+            padding=self.conv._padding,
+            groups=self.conv._groups,
+            bias_attr=True,
+        )
+        conv.weight.set_value(kernel)
+        conv.bias.set_value(bias)
+        self.__delattr__("conv")
+        self.__delattr__("bn")
+        self.conv = conv
+
+
+class ConvBNReLU(ConvBN):
+    """Conv + BN + ReLU"""
+
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        stride=1,
+        padding=0,
+        groups=1,
+        deploy=False,
+    ):
+        super().__init__(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            groups=groups,
+            deploy=deploy,
+            nonlinear=nn.ReLU(),
+        )
+
+
+class RepMLPLightBlock(nn.Layer):
+    """RepMLPLightBlock Layer
+
+    The base module of the Light structure RepMLPResNet network
+    """
+
+    def __init__(
+        self,
+        in_channels,
+        mid_channels,
+        out_channels,
+        H,
+        W,
+        h,
+        w,
+        reparam_conv_k,
+        fc1_fc2_reduction,
+        fc3_groups,
+        deploy=False,
+    ):
+        super().__init__()
+        if in_channels != out_channels:
+            self.shortcut = ConvBN(
+                in_channels, out_channels, kernel_size=1, deploy=deploy
+            )
+        else:
+            self.shortcut = Identity()
+        self.light_conv1 = ConvBNReLU(
+            in_channels, mid_channels, kernel_size=1, deploy=deploy
+        )
+        self.light_repmlp = RepMLP(
+            in_channels=mid_channels,
+            out_channels=mid_channels,
+            H=H,
+            W=W,
+            h=h,
+            w=w,
+            reparam_conv_k=reparam_conv_k,
+            fc1_fc2_reduction=fc1_fc2_reduction,
+            fc3_groups=fc3_groups,
+            deploy=deploy,
+        )
+        self.repmlp_nonlinear = nn.ReLU()
+        self.light_conv3 = ConvBN(
+            mid_channels, out_channels, kernel_size=1, deploy=deploy
+        )
+        self.relu = nn.ReLU()
+
+    def forward(self, x):
+        out = self.light_conv1(x)
+        out = self.light_repmlp(out)
+        out = self.repmlp_nonlinear(out)
+        out = self.light_conv3(out)
+        out += self.shortcut(x)
+        out = self.relu(out)
+        return out
+
+
+#   The input_ and output_channels of RepMLP are both mid_channels // r
+class RepMLPBottleneckBlock(nn.Layer):
+    """RepMLPBottleneckBlock Layer
+
+    The base module of the bottleneck structure RepMLPResNet network
+    """
+
+    def __init__(
+        self,
+        in_channels,
+        mid_channels,
+        out_channels,
+        r,
+        H,
+        W,
+        h,
+        w,
+        reparam_conv_k,
+        fc1_fc2_reduction,
+        fc3_groups,
+        deploy=False,
+    ):
+        super().__init__()
+        if in_channels != out_channels:
+            self.shortcut = ConvBN(
+                in_channels, out_channels, kernel_size=1, deploy=deploy
+            )
+        else:
+            self.shortcut = Identity()
+        repmlp_channels = mid_channels // r
+        self.btnk_conv1 = ConvBNReLU(
+            in_channels, mid_channels, kernel_size=1, deploy=deploy
+        )
+        self.btnk_conv2 = ConvBNReLU(
+            mid_channels, repmlp_channels, kernel_size=3, padding=1, deploy=deploy
+        )
+        self.btnk_repmlp = RepMLP(
+            in_channels=repmlp_channels,
+            out_channels=repmlp_channels,
+            H=H,
+            W=W,
+            h=h,
+            w=w,
+            reparam_conv_k=reparam_conv_k,
+            fc1_fc2_reduction=fc1_fc2_reduction,
+            fc3_groups=fc3_groups,
+            deploy=deploy,
+        )
+        self.repmlp_nonlinear = nn.ReLU()
+        self.btnk_conv4 = ConvBNReLU(
+            repmlp_channels, mid_channels, kernel_size=3, padding=1, deploy=deploy
+        )
+        self.btnk_conv5 = ConvBN(
+            mid_channels, out_channels, kernel_size=1, deploy=deploy
+        )
+        self.relu = nn.ReLU()
+
+    def forward(self, x):
+        out = self.btnk_conv1(x)
+        out = self.btnk_conv2(out)
+        out = self.btnk_repmlp(out)
+        out = self.repmlp_nonlinear(out)
+        out = self.btnk_conv4(out)
+        out = self.btnk_conv5(out)
+        out += self.shortcut(x)
+        out = self.relu(out)
+        return out
+
+
+#   Original block of ResNet-50
+
+
+class BaseBlock(nn.Layer):
+    """BaseBlock Layer
+
+    Constitute the basic building blocks of a RepMLPResNet network
+    """
+
+    def __init__(self, in_channels, mid_channels, out_channels, stride=1, deploy=False):
+        super().__init__()
+        if stride != 1 or in_channels != out_channels:
+            self.shortcut = ConvBN(
+                in_channels, out_channels, kernel_size=1, stride=stride, deploy=deploy
+            )
+        else:
+            self.shortcut = Identity()
+        self.conv1 = ConvBNReLU(in_channels, mid_channels, kernel_size=1, deploy=deploy)
+        self.conv2 = ConvBNReLU(
+            mid_channels,
+            mid_channels,
+            kernel_size=3,
+            stride=stride,
+            padding=1,
+            deploy=deploy,
+        )
+        self.conv3 = ConvBN(mid_channels, out_channels, kernel_size=1, deploy=deploy)
+
+    def forward(self, x):
+        out = self.conv1(x)
+        out = self.conv2(out)
+        out = self.conv3(out)
+        out += self.shortcut(x)
+        out = F.relu(out)
+        return out
+
+
+class RepMLPResNet(nn.Layer):
+    """RepMLPResNet-50 Layer
+
+    RepMLPResNet-50 has three structures:
+    base: original ResNet-50
+    light: RepMLP Light Block (55% faster, comparable accuracy)
+    bottleneck: RepMLP Bottleneck Block (much higher accuracy, comparable speed)
+
+    Args:
+        block_type(str): "base", "light", "bottleneck"
+    """
+
+    def __init__(
+        self,
+        num_blocks,
+        num_classes,
+        block_type,
+        img_H,
+        img_W,
+        h,
+        w,
+        reparam_conv_k,
+        fc1_fc2_reduction,
+        fc3_groups,
+        deploy=False,
+        # r=2 for stage2 and r=4 for stage3
+        bottleneck_r=(2, 4),
+    ):
+        super().__init__()
+        assert block_type in ["base", "light", "bottleneck"]
+        self.block_type = block_type
+        self.deploy = deploy
+
+        self.img_H = img_H
+        self.img_W = img_W
+        self.h = h
+        self.w = w
+        self.reparam_conv_k = reparam_conv_k
+        self.fc1_fc2_reduction = fc1_fc2_reduction
+        self.fc3_groups = fc3_groups
+        self.bottleneck_r = bottleneck_r
+
+        self.in_channels = 64
+        channels = [256, 512, 1024, 2048]
+
+        self.stage0 = nn.Sequential(
+            ConvBNReLU(
+                in_channels=3,
+                out_channels=self.in_channels,
+                kernel_size=7,
+                stride=2,
+                padding=3,
+                deploy=deploy,
+            ),
+            nn.MaxPool2D(kernel_size=3, stride=2, padding=1),
+        )
+        self.stage1 = self._make_stage(
+            channels[0], num_blocks[0], stride=1, total_downsample_ratio=4
+        )
+        self.stage2 = self._make_stage(
+            channels[1], num_blocks[1], stride=2, total_downsample_ratio=8
+        )
+        self.stage3 = self._make_stage(
+            channels[2], num_blocks[2], stride=2, total_downsample_ratio=16
+        )
+        self.stage4 = self._make_stage(
+            channels[3], num_blocks[3], stride=2, total_downsample_ratio=32
+        )
+        self.gap = nn.AdaptiveAvgPool2D(output_size=1)
+        self.linear = nn.Linear(channels[3], num_classes)
+
+    def forward(self, x):
+        out = self.stage0(x)
+        out = self.stage1(out)
+        out = self.stage2(out)
+        out = self.stage3(out)
+        out = self.stage4(out)
+        out = self.gap(out)
+        out = out.reshape([out.shape[0], -1])
+        out = self.linear(out)
+        return out
+
+    def _make_stage(self, channels, num_blocks, stride, total_downsample_ratio):
+        strides = [stride] + [1] * (num_blocks - 1)
+        blocks = []
+        for _, stride in enumerate(strides):
+            # Only use RepMLP in stage2 and stage3, as described in the paper
+            if (
+                self.block_type == "base"
+                or stride == 2
+                or (total_downsample_ratio not in [8, 16])
+            ):
+                cur_block = BaseBlock(
+                    in_channels=self.in_channels,
+                    mid_channels=channels // 4,
+                    out_channels=channels,
+                    stride=stride,
+                    deploy=self.deploy,
+                )
+            elif self.block_type == "light":
+                cur_block = RepMLPLightBlock(
+                    in_channels=self.in_channels,
+                    mid_channels=channels // 8,
+                    out_channels=channels,
+                    H=self.img_H // total_downsample_ratio,
+                    W=self.img_W // total_downsample_ratio,
+                    h=self.h,
+                    w=self.w,
+                    reparam_conv_k=self.reparam_conv_k,
+                    fc1_fc2_reduction=self.fc1_fc2_reduction,
+                    fc3_groups=self.fc3_groups,
+                    deploy=self.deploy,
+                )
+            elif self.block_type == "bottleneck":
+                cur_block = RepMLPBottleneckBlock(
+                    in_channels=self.in_channels,
+                    mid_channels=channels // 4,
+                    out_channels=channels,
+                    r=self.bottleneck_r[0]
+                    if total_downsample_ratio == 8
+                    else self.bottleneck_r[1],
+                    H=self.img_H // total_downsample_ratio,
+                    W=self.img_W // total_downsample_ratio,
+                    h=self.h,
+                    w=self.w,
+                    reparam_conv_k=self.reparam_conv_k,
+                    fc1_fc2_reduction=self.fc1_fc2_reduction,
+                    fc3_groups=self.fc3_groups,
+                    deploy=self.deploy,
+                )
+            else:
+                raise ValueError("Not supported.")
+
+            blocks.append(cur_block)
+            self.in_channels = channels
+
+        return nn.Sequential(*blocks)
+
+
+def build_repmlp_resnet(config):
+    model = RepMLPResNet(
+        num_blocks=config.MODEL.MIXER.NUM_BLOCKS,
+        num_classes=config.MODEL.NUM_CLASSES,
+        block_type=config.MODEL.MIXER.BLOCK_TYPE,
+        img_H=config.MODEL.MIXER.IMG_H,
+        img_W=config.MODEL.MIXER.IMG_W,
+        h=config.MODEL.MIXER.H,
+        w=config.MODEL.MIXER.W,
+        reparam_conv_k=config.MODEL.MIXER.REPARAM_CONV_K,
+        fc1_fc2_reduction=config.MODEL.MIXER.FC1_FC2_REDUCTION,
+        fc3_groups=config.MODEL.MIXER.FC3_GROUPS,
+        deploy=config.MODEL.MIXER.DEPLOY,
+    )
+    return model
+
+
+def TestConvBN():
+    print("=== Test training_to_deploy for ConvBN ===")
+    x = paddle.randn([1, 5, 22, 22])
+    print("input shape:", x.shape)
+    m = ConvBN(5, 10, 3)
+    m.eval()
+
+    out = m(x)
+    m.switch_to_deploy()
+    deployout = m(x)
+    print("difference between the outputs of the training-time and converted ConvBN")
+    print(((deployout - out) ** 2).sum().numpy().item())
+
+
+def TestModel():
+    print("=== Test training_to_deploy for RepMLP_ResNet ===")
+
+    x = paddle.randn([1, 3, 224, 224])
+    print("input shape:", x.shape)
+
+    model = RepMLPResNet(
+        num_blocks=[3, 4, 6, 3],
+        num_classes=1000,
+        block_type="light",
+        img_H=224,
+        img_W=224,
+        h=7,
+        w=7,
+        reparam_conv_k=(1, 3, 5),
+        fc1_fc2_reduction=1,
+        fc3_groups=4,
+        deploy=False,
+    )
+    model.eval()
+
+    out = model(x)
+    deploy_model = repmlp_model_convert(model)
+    deployout = deploy_model(x)
+    print(
+        "difference between the outputs of the training-time and converted RepMLP_ResNet"
+    )
+    print(((deployout - out) ** 2).sum().numpy().item())
+    print("Done!")
+
+
+if __name__ == "__main__":
+    TestConvBN()
+    TestModel()
diff --git a/image_classification/RepMLP/run_eval.sh b/image_classification/RepMLP/run_eval.sh
new file mode 100644
index 00000000..a0f7ebb4
--- /dev/null
+++ b/image_classification/RepMLP/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/repmlpres50_light_224_train.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./RepMLP-Res50-light-224_train'
diff --git a/image_classification/RepMLP/run_eval_multi.sh b/image_classification/RepMLP/run_eval_multi.sh
new file mode 100644
index 00000000..aff88a92
--- /dev/null
+++ b/image_classification/RepMLP/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/repmlpres50_light_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./RepMLP-Res50-light-224_train'
diff --git a/image_classification/RepMLP/run_train.sh b/image_classification/RepMLP/run_train.sh
new file mode 100644
index 00000000..926c0868
--- /dev/null
+++ b/image_classification/RepMLP/run_train.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/repmlpres50_light_224_train.yaml' \
+-dataset='imagenet2012' \
+-batch_size=4 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/RepMLP/run_train_multi.sh b/image_classification/RepMLP/run_train_multi.sh
new file mode 100644
index 00000000..309d17bb
--- /dev/null
+++ b/image_classification/RepMLP/run_train_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/repmlpres50_light_224_train.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/RepMLP/transforms.py b/image_classification/RepMLP/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/RepMLP/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/RepMLP/utils.py b/image_classification/RepMLP/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/RepMLP/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/ResMLP/README.md b/image_classification/ResMLP/README.md
index b52ea5db..f8b4e22a 100644
--- a/image_classification/ResMLP/README.md
+++ b/image_classification/ResMLP/README.md
@@ -5,29 +5,41 @@ PaddlePaddle training/validation code and pretrained models for **ResMLP**.
 The official and 3rd party pytorch implementation are [here](https://github.com/facebookresearch/deit) and [here](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py).
 
 
-This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
-
+This implementation is developed by [PPViT](https://github.com/xperzy/PPViT/tree/master).
 
 <p align="center">
-<img src="./resmlp.png" alt="drawing" width="90%" height="90%"/>
-    <h4 align="center">ResMLP Model Overview</h4>
+<img src="./resmlp.png" alt="drawing" width="100%" height="100%"/>
+<h4 align="center">ResMLP Model Overview</h4>
 </p>
 
-
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+
+- Update (2020-09-27): Model FLOPs and # params are uploaded.
+- Update (2020-09-24): Update new ResMLP weights.
+
+- Update (2020-09-23): Add new ResMLP weights.
+
+- Update (2020-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| resmlp_24_224                  | 79.38 | 94.55 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15A5q1XSXBz-y1AcXhy_XaDymLLj2s2Tn/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nLAvyG53REdwYNCLmp4yBA)(jdcx) |
-| resmlp_36_224             | 79.77 | 94.89 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1WrhVm-7EKnLmPU18Xm0C7uIqrg-RwqZL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QD4EWmM9b2u1r8LsnV6rUA)(33w3) |
-| resmlp_big_24_224         | 81.04 | 95.02 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/1KLlFuzYb17tC5Mmue3dfyr2L_q4xHTZi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1oXU6CR0z7O0XNwu_UdZv_w)(r9kb) |
-| resmlp_big_24_distilled_224 | 83.59 | 96.65 | 224        | 0.875      | bicubic      | [google](https://drive.google.com/file/d/199q0MN_BlQh9-HbB28RdxHj1ApMTHow-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yUrfbqW8vLODDiRV5WWkhQ)(4jk5) |
+
+**Original**:
+
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| resmlp_24_224                	| 79.38 | 94.55 | 30.0M   | 6.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15A5q1XSXBz-y1AcXhy_XaDymLLj2s2Tn/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nLAvyG53REdwYNCLmp4yBA)(jdcx) |
+| resmlp_36_224             	| 79.77 | 94.89 | 44.7M   | 9.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1WrhVm-7EKnLmPU18Xm0C7uIqrg-RwqZL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QD4EWmM9b2u1r8LsnV6rUA)(33w3) |
+| resmlp_big_24_224         	| 81.04 | 95.02 | 129.1M  | 100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1KLlFuzYb17tC5Mmue3dfyr2L_q4xHTZi/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1oXU6CR0z7O0XNwu_UdZv_w)(r9kb) |
+| resmlp_12_distilled_224 		| 77.95 | 93.56 | 15.3M   |	3.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1cDMpAtCB0pPv6F-VUwvgwAaYtmP8IfRw/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15kJeZ_V1MMjTX9f1DBCgnw)(ghyp) |
+| resmlp_24_distilled_224 		| 80.76 | 95.22 | 30.0M   |	6.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/15d892ExqR1sIAjEn-cWGlljX54C3vihA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NgQtSwuAwsVVOB8U6N4Aqw)(sxnx) |
+| resmlp_36_distilled_224 		| 81.15 | 95.48 | 44.7M	  | 9.0G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1Laqz1oDg-kPh6eb6bekQqnE0m-JXeiep/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1p1xGOJbMzH_RWEj36ruQiw)(vt85) |
+| resmlp_big_24_distilled_224 	| 83.59 | 96.65 | 129.1M  |	100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/199q0MN_BlQh9-HbB28RdxHj1ApMTHow-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yUrfbqW8vLODDiRV5WWkhQ)(4jk5) |
+| resmlp_big_24_22k_224   		| 84.40 | 97.11 | 129.1M  | 100.7G | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1zATKq1ruAI_kX49iqJOl-qomjm9il1LC/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1VrnRMbzzZBmLiR45YwICmA)(ve7i) |
+
 
 > *The results are evaluated on ImageNet2012 validation set.
-> 
-> Note: ResMLP weights are ported from [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py)
+>
+> Note: ResMLP weights are ported from [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py) and [facebookresearch](https://github.com/facebookresearch/deit/blob/main/README_resmlp.md)
 
 
 
@@ -72,8 +84,8 @@ from resmlp import build_res_mlp as build_model
 config = get_config('./configs/resmlp_24_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./resmlp_24_224')
+# load pretrained weights
+model_state_dict = paddle.load('./resmlp_24_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -86,12 +98,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/resmlp_24_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/resmlp_24_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./resmlp_24_224'
+    -pretrained=./path/to/pretrained/model/resmlp_24_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -108,12 +120,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/resmlp_24_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/resmlp_24_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./resmlp_24_224'
+    -pretrained=/path/to/pretrained/model/resmlp_24_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -127,10 +139,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/resmlp_24_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/resmlp_24_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 <details>
@@ -147,10 +159,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/resmlp_24_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/resmlp_24_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \ 
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/ResMLP/__init__.py b/image_classification/ResMLP/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/ResMLP/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/ResMLP/augment.py b/image_classification/ResMLP/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/ResMLP/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/ResMLP/config.py b/image_classification/ResMLP/config.py
index 3ab6abf0..3643d233 100644
--- a/image_classification/ResMLP/config.py
+++ b/image_classification/ResMLP/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -35,6 +35,8 @@
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
 _C.DATA.CROP_PCT = 1.0 # input image scale ratio, scale is applied before centercrop in eval mode
 _C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -43,8 +45,9 @@
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
-_C.MODEL.DROPPATH = 0.1
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
 
 # transformer settings
 _C.MODEL.MIXER = CN()
@@ -56,13 +59,14 @@
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.01 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 1e-5
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -76,6 +80,24 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
@@ -84,8 +106,9 @@
 _C.VALIDATE_FREQ = 20 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
-_C.NGPUS = 1
+_C.NGPUS = -1
 
 
 def _update_config_from_file(config, cfg_file):
@@ -117,8 +140,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -130,6 +157,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/ResMLP/configs/resmlp_12_224.yaml b/image_classification/ResMLP/configs/resmlp_12_224.yaml
new file mode 100644
index 00000000..44cee0aa
--- /dev/null
+++ b/image_classification/ResMLP/configs/resmlp_12_224.yaml
@@ -0,0 +1,11 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ResMLP
+    NAME: resmlp_12_224
+    MIXER:
+        PATCH_SIZE: 16
+        HIDDEN_SIZE: 384
+        NUM_LAYERS: 12
+
diff --git a/image_classification/ResMLP/datasets.py b/image_classification/ResMLP/datasets.py
index a52d9fe3..304df9a3 100644
--- a/image_classification/ResMLP/datasets.py
+++ b/image_classification/ResMLP/datasets.py
@@ -19,8 +19,20 @@
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -80,13 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -106,11 +141,10 @@ def get_val_transforms(config):
 
     scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
     transforms_val = transforms.Compose([
-        transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
+        transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
@@ -125,6 +159,7 @@ def get_dataset(config, mode='train'):
     Returns:
         dataset: dataset object
     """
+
     assert mode in ['train', 'val']
     if config.DATA.DATASET == "cifar10":
         if mode == 'train':
diff --git a/image_classification/ResMLP/droppath.py b/image_classification/ResMLP/droppath.py
index fcff05e9..c8fe8048 100644
--- a/image_classification/ResMLP/droppath.py
+++ b/image_classification/ResMLP/droppath.py
@@ -32,6 +32,7 @@ def drop_path(inputs, drop_prob=0., training=False):
     if drop_prob == 0. or not training:
         return inputs
     keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
     shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
     random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
     random_tensor = random_tensor.floor() # mask
diff --git a/image_classification/ResMLP/losses.py b/image_classification/ResMLP/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/ResMLP/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/ResMLP/main_multi_gpu.py b/image_classification/ResMLP/main_multi_gpu.py
index 6dd1b915..4f83a949 100644
--- a/image_classification/ResMLP/main_multi_gpu.py
+++ b/image_classification/ResMLP/main_multi_gpu.py
@@ -25,54 +25,55 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from resmlp import build_res_mlp as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from resmlp import build_res_mlp as build_model
 
 
-parser = argparse.ArgumentParser('ResMLP')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ResMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,18 +81,28 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
         train_loss_meter.avg
         train_acc_meter.avg
@@ -100,63 +111,120 @@ def train(dataloader,
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -171,56 +239,140 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -242,7 +394,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -273,76 +427,120 @@ def main_worker(*args):
             #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -350,15 +548,33 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/ResMLP/main_single_gpu.py b/image_classification/ResMLP/main_single_gpu.py
index f50ed7b6..ded94338 100644
--- a/image_classification/ResMLP/main_single_gpu.py
+++ b/image_classification/ResMLP/main_single_gpu.py
@@ -1,5 +1,4 @@
-
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,53 +26,54 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from resmlp import build_res_mlp as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from resmlp import build_res_mlp as build_model
 
 
-parser = argparse.ArgumentParser('ResMLP')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ResMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -81,56 +81,82 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -139,19 +165,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -176,7 +203,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -188,24 +215,77 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -214,8 +294,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -227,9 +306,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -249,58 +328,67 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip)
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
-        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -312,9 +400,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
diff --git a/image_classification/ResMLP/mixup.py b/image_classification/ResMLP/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/ResMLP/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/ResMLP/port_weights/__init__.py b/image_classification/ResMLP/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/ResMLP/random_erasing.py b/image_classification/ResMLP/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/ResMLP/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/ResMLP/resmlp.py b/image_classification/ResMLP/resmlp.py
index 2f83ea9a..9ea3f200 100644
--- a/image_classification/ResMLP/resmlp.py
+++ b/image_classification/ResMLP/resmlp.py
@@ -1,3 +1,21 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for ResMLP
+"""
+
 import math
 import copy
 import paddle
@@ -209,5 +227,5 @@ def build_res_mlp(config):
                    embed_dim=config.MODEL.MIXER.HIDDEN_SIZE,
                    mlp_ratio=4,
                    dropout=config.MODEL.DROPOUT,
-                   droppath=config.MODEL.DROPPATH)
+                   droppath=config.MODEL.DROP_PATH)
     return model
diff --git a/image_classification/ResMLP/run_train.sh b/image_classification/ResMLP/run_train.sh
index 8ac87545..f4a78fb1 100644
--- a/image_classification/ResMLP/run_train.sh
+++ b/image_classification/ResMLP/run_train.sh
@@ -1,6 +1,7 @@
-CUDA_VISIBLE_DEVICES=7 \
+CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
 -cfg='./configs/resmlp_24_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/ResMLP/run_train_multi.sh b/image_classification/ResMLP/run_train_multi.sh
index 21b8f546..f767b45d 100644
--- a/image_classification/ResMLP/run_train_multi.sh
+++ b/image_classification/ResMLP/run_train_multi.sh
@@ -1,7 +1,7 @@
-CUDA_VISIBLE_DEVICES=4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/resmlp_24_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
--ngpus=4
+-amp
diff --git a/image_classification/ResMLP/transforms.py b/image_classification/ResMLP/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/ResMLP/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/Shuffle_Transformer/.config.py.swp b/image_classification/Shuffle_Transformer/.config.py.swp
deleted file mode 100644
index e144ebf1..00000000
Binary files a/image_classification/Shuffle_Transformer/.config.py.swp and /dev/null differ
diff --git a/image_classification/Shuffle_Transformer/README.md b/image_classification/Shuffle_Transformer/README.md
index 302b7483..108f1fa3 100644
--- a/image_classification/Shuffle_Transformer/README.md
+++ b/image_classification/Shuffle_Transformer/README.md
@@ -14,14 +14,15 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 </p>
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-08-11): Model FLOPs and # params are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| shuffle_vit_tiny_patch4_window7| 82.39  | 96.05 | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1ffJ-tG_CGVXztPEPQMaT_lUoc4hxFy__/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19DhlLIFyPGOWtyq_c83ZGQ)(8a1i) |
-| shuffle_vit_small_patch4_window7| 83.53 | 96.57 | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1du9H0SKr0QH9GQjhWDOXOnhpSVpfbb8X/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rM2J8BVwxQ3kRZoHngwNZA)(xwh3) |
-| shuffle_vit_base_patch4_window7| 83.95  | 96.91 | 224        | 0.875      | bicubic       | [google](https://drive.google.com/file/d/1sYh808AyTG3-_qv6nfN6gCmyagsNAE6q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1fks_IYDdnXdAkCFuYHW_Nw)(1gsr) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| shuffle_vit_tiny  			| 82.39 | 96.05 | 28.5M   | 4.6G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1ffJ-tG_CGVXztPEPQMaT_lUoc4hxFy__/view?usp=sharing)/[baidu](https://pan.baidu.com/s/19DhlLIFyPGOWtyq_c83ZGQ)(8a1i) |
+| shuffle_vit_small 			| 83.53 | 96.57 | 50.1M   | 8.8G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1du9H0SKr0QH9GQjhWDOXOnhpSVpfbb8X/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rM2J8BVwxQ3kRZoHngwNZA)(xwh3) |
+| shuffle_vit_base  			| 83.95 | 96.91 | 88.4M   | 15.5G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1sYh808AyTG3-_qv6nfN6gCmyagsNAE6q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1fks_IYDdnXdAkCFuYHW_Nw)(1gsr) |
 
 > *The results are evaluated on ImageNet2012 validation set.
 
@@ -66,8 +67,8 @@ from shuffle_transformer import build_shuffle_transformer as build_model
 config = get_config('./configs/shuffle_vit_base_patch4_window7_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./shuffle_vit_base_patch4_window7_224')
+# load pretrained weights
+model_state_dict = paddle.load('./shuffle_vit_base_patch4_window7_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -80,12 +81,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/shuffle_vit_base_patch4_window7_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/shuffle_vit_base_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./shuffle_vit_base_patch4_window7_224'
+    -pretrained=/path/to/pretrained/model/shuffle_vit_base_patch4_window7_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -102,12 +103,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/shuffle_vit_base_patch4_window7_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/shuffle_vit_base_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./shuffle_vit_base_patch4_window7_224'
+    -pretrained=/path/to/pretrained/model/shuffle_vit_base_patch4_window7_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -122,10 +123,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/shuffle_vit_base_patch4_window7_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/shuffle_vit_base_patch4_window7_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 
@@ -143,10 +144,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/shuffle_vit_base_patch4_window7_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/shuffle_vit_base_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=32 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/Shuffle_Transformer/__init__.py b/image_classification/Shuffle_Transformer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/Shuffle_Transformer/augment.py b/image_classification/Shuffle_Transformer/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/Shuffle_Transformer/config.py b/image_classification/Shuffle_Transformer/config.py
index ab6f07bf..55931dcd 100644
--- a/image_classification/Shuffle_Transformer/config.py
+++ b/image_classification/Shuffle_Transformer/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -19,6 +19,7 @@
 
 
 """
+
 import os
 from yacs.config import CfgNode as CN
 import yaml
@@ -34,7 +35,9 @@
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size
 _C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -68,10 +71,11 @@
 _C.TRAIN.WARMUP_EPOCHS = 20
 _C.TRAIN.WEIGHT_DECAY = 0.05
 _C.TRAIN.BASE_LR = 0.001
-_C.TRAIN.WARMUP_START_LR = 0.0
-_C.TRAIN.END_LR = 0.0
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -80,33 +84,38 @@
 _C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
 
 _C.TRAIN.OPTIMIZER = CN()
-_C.TRAIN.OPTIMIZER.NAME = 'SGD'
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
 _C.TRAIN.OPTIMIZER.EPS = 1e-8
-_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
-# augmentation
-_C.AUG = CN()
-_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
-_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
-_C.AUG.RE_PROB = 0.25 # random earse prob
-_C.AUG.RE_MODE = 'pixel' # random earse mode
-_C.AUG.RE_COUNT = 1 # random earse count
-_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
-_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
-_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
-_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
-_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
-_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
 
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 20 # freq to save chpt
+_C.SAVE_FREQ = 1 # freq to save chpt
 _C.REPORT_FREQ = 50 # freq to logging info
-_C.VALIDATE_FREQ = 20 # freq to do validation
-_C.SEED = 0
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 42
 _C.EVAL = False # run evaluation only
+_C.AMP = False
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -124,6 +133,7 @@ def _update_config_from_file(config, cfg_file):
     config.merge_from_file(cfg_file)
     config.freeze()
 
+
 def update_config(config, args):
     """Update config by ArgumentParser
     Args:
@@ -140,8 +150,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -153,6 +167,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/Shuffle_Transformer/datasets.py b/image_classification/Shuffle_Transformer/datasets.py
index 78a3db09..6406193a 100644
--- a/image_classification/Shuffle_Transformer/datasets.py
+++ b/image_classification/Shuffle_Transformer/datasets.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -19,8 +19,19 @@
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -80,13 +91,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -106,11 +140,10 @@ def get_val_transforms(config):
 
     scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
     transforms_val = transforms.Compose([
-        transforms.Resize(scale_size, 'bicubic'),
+        transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/Shuffle_Transformer/losses.py b/image_classification/Shuffle_Transformer/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/Shuffle_Transformer/main_multi_gpu.py b/image_classification/Shuffle_Transformer/main_multi_gpu.py
index 4dbe0ccb..890d2ada 100644
--- a/image_classification/Shuffle_Transformer/main_multi_gpu.py
+++ b/image_classification/Shuffle_Transformer/main_multi_gpu.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""Shuffle Transformer training/validation using multiple GPU """
+"""Swin training/validation using multiple GPU """
 
 import sys
 import os
@@ -25,52 +25,56 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from shuffle_transformer import build_shuffle_transformer as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from shuffle_transformer import build_shuffle_transformer as build_model
 
 
-parser = argparse.ArgumentParser('Shuffle Transformer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Shuffle Transformer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -78,83 +82,152 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -169,56 +242,140 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -240,7 +397,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -267,79 +426,124 @@ def main_worker(*args):
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -347,15 +551,33 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/Shuffle_Transformer/main_single_gpu.py b/image_classification/Shuffle_Transformer/main_single_gpu.py
index bc77ef27..c21f55e2 100644
--- a/image_classification/Shuffle_Transformer/main_single_gpu.py
+++ b/image_classification/Shuffle_Transformer/main_single_gpu.py
@@ -1,3 +1,4 @@
+
 #   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -26,53 +27,54 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from shuffle_transformer import build_shuffle_transformer as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from shuffle_transformer import build_shuffle_transformer as build_model
 
 
-parser = argparse.ArgumentParser('Shuffle Transformer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Shuffle Transformer')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,56 +82,82 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -138,19 +166,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -175,7 +204,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -187,24 +216,77 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -213,8 +295,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -226,9 +307,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -248,58 +329,70 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
-            grad_clip=clip)
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -311,9 +404,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
diff --git a/image_classification/Shuffle_Transformer/mixup.py b/image_classification/Shuffle_Transformer/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/Shuffle_Transformer/port_weights/__init__.py b/image_classification/Shuffle_Transformer/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/Shuffle_Transformer/random_erasing.py b/image_classification/Shuffle_Transformer/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/Shuffle_Transformer/run_train.sh b/image_classification/Shuffle_Transformer/run_train.sh
index 8c2484d8..49e214d1 100644
--- a/image_classification/Shuffle_Transformer/run_train.sh
+++ b/image_classification/Shuffle_Transformer/run_train.sh
@@ -2,5 +2,6 @@ CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
 -cfg='./configs/shuffle_vit_tiny_patch4_window7_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=64 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/Shuffle_Transformer/run_train_multi.sh b/image_classification/Shuffle_Transformer/run_train_multi.sh
index eaaa7a61..679fd0b2 100644
--- a/image_classification/Shuffle_Transformer/run_train_multi.sh
+++ b/image_classification/Shuffle_Transformer/run_train_multi.sh
@@ -2,5 +2,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/shuffle_vit_tiny_patch4_window7_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/Shuffle_Transformer/shuffle_transformer.py b/image_classification/Shuffle_Transformer/shuffle_transformer.py
index dc419852..6f6287fd 100644
--- a/image_classification/Shuffle_Transformer/shuffle_transformer.py
+++ b/image_classification/Shuffle_Transformer/shuffle_transformer.py
@@ -52,14 +52,16 @@ def __init__(self,
                  embed_dim=48,
                  in_channels=3):
         super().__init__()
+        w_attr_1, b_attr_1 = self._init_weights_batchnorm()
         self.conv1 = nn.Sequential(
             nn.Conv2D(in_channels, inter_dim, kernel_size=3, stride=2, padding=1),
-            nn.BatchNorm2D(inter_dim),
+            nn.BatchNorm2D(inter_dim, weight_attr=w_attr_1, bias_attr=b_attr_1),
             nn.ReLU6())
 
+        w_attr_2, b_attr_2 = self._init_weights_batchnorm()
         self.conv2 = nn.Sequential(
             nn.Conv2D(inter_dim, embed_dim, kernel_size=3, stride=2, padding=1),
-            nn.BatchNorm2D(embed_dim),
+            nn.BatchNorm2D(embed_dim, weight_attr=w_attr_2, bias_attr=b_attr_2),
             nn.ReLU6())
 
         self.conv3 = nn.Conv2D(embed_dim, embed_dim, kernel_size=1, stride=1, padding=0)
@@ -67,6 +69,11 @@ def __init__(self,
         # 4 = stride * stride
         self.num_patches = (image_size // 4) * (image_size // 4)
 
+    def _init_weights_batchnorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, inputs):
         out = self.conv1(inputs)
         out = self.conv2(out)
@@ -291,7 +298,8 @@ def __init__(self,
                  attention_dropout=0.,
                  droppath=0.):
         super().__init__()
-        self.norm1 = nn.BatchNorm2D(dim)
+        w_attr_1, b_attr_1 = self._init_weights_batchnorm()
+        self.norm1 = nn.BatchNorm2D(dim, weight_attr=w_attr_1, bias_attr=b_attr_1)
         self.attn = WindowAttention(dim,
                                     num_heads=num_heads,
                                     window_size=window_size,
@@ -308,10 +316,17 @@ def __init__(self,
                                padding=window_size // 2,
                                groups=dim)
         self.drop_path = DropPath(droppath)
-        self.norm2 = nn.BatchNorm2D(dim)
+        w_attr_2, b_attr_2 = self._init_weights_batchnorm()
+        self.norm2 = nn.BatchNorm2D(dim, weight_attr=w_attr_2, bias_attr=b_attr_2)
         mlp_hidden_dim = int(dim * mlp_ratio)
         self.mlp = MLP(dim, mlp_hidden_dim, out_dim, dropout)
-        self.norm3 = nn.BatchNorm2D(dim)
+        w_attr_3, b_attr_3 = self._init_weights_batchnorm()
+        self.norm3 = nn.BatchNorm2D(dim, weight_attr=w_attr_3, bias_attr=b_attr_3)
+
+    def _init_weights_batchnorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+        return weight_attr, bias_attr
 
     def forward(self, x):
         # attention
@@ -341,7 +356,8 @@ class PatchMerging(nn.Layer):
     """
     def __init__(self, in_dim=32, out_dim=64):
         super().__init__()
-        self.norm = nn.BatchNorm2D(in_dim)
+        w_attr_1, b_attr_1 = self._init_weights_batchnorm()
+        self.norm = nn.BatchNorm2D(in_dim, weight_attr=w_attr_1, bias_attr=b_attr_1)
         self.reduction = nn.Conv2D(in_dim,
                                    out_dim,
                                    kernel_size=2,
@@ -349,6 +365,11 @@ def __init__(self, in_dim=32, out_dim=64):
                                    padding=0,
                                    bias_attr=False)
 
+    def _init_weights_batchnorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
     def forward(self, inputs):
         out = self.norm(inputs)
         out = self.reduction(out)
@@ -477,7 +498,13 @@ def __init__(self,
                                            dropout=dropout,
                                            droppath=dprs[i]))
         self.avgpool = nn.AdaptiveAvgPool2D(1)
-        self.head = nn.Linear(dims[-1], num_classes)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.head = nn.Linear(dims[-1], num_classes, weight_attr=w_attr_1, bias_attr=b_attr_1)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+        return weight_attr, bias_attr
 
     def forward_features(self, inputs):
         out = self.patch_embedding(inputs)
@@ -500,6 +527,7 @@ def build_shuffle_transformer(config):
     """ build shuffle transformer using config"""
     model = ShuffleTransformer(image_size=config.DATA.IMAGE_SIZE,
                                embed_dim=config.MODEL.TRANS.EMBED_DIM,
+                               num_classes=config.MODEL.NUM_CLASSES,
                                mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
                                layers=config.MODEL.TRANS.DEPTHS,
                                num_heads=config.MODEL.TRANS.NUM_HEADS,
diff --git a/image_classification/Shuffle_Transformer/stat.py b/image_classification/Shuffle_Transformer/stat.py
new file mode 100644
index 00000000..892185b6
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/stat.py
@@ -0,0 +1,64 @@
+import os
+import glob
+import paddle
+from config import get_config
+from shuffle_transformer import build_shuffle_transformer as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*_224.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    #input_size = (1, 3, 384, 384)
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/Shuffle_Transformer/transforms.py b/image_classification/Shuffle_Transformer/transforms.py
new file mode 100644
index 00000000..676fe1ff
--- /dev/null
+++ b/image_classification/Shuffle_Transformer/transforms.py
@@ -0,0 +1,13 @@
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/SwinTransformer/README.md b/image_classification/SwinTransformer/README.md
index 92081ccd..2fbc02b4 100644
--- a/image_classification/SwinTransformer/README.md
+++ b/image_classification/SwinTransformer/README.md
@@ -13,17 +13,30 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 </p>
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+* Update (2021-10-11): New main function for single and multiplt gpus are updated.
+* Update (2021-10-11): Training from scratch is available.
+* Update (2021-09-27): Model FLOPs and num params are uploaded.
+* Update (2021-09-10): More ported weights are uploaded.
+* Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| swin_base_patch4_window7_224   | 85.27 | 97.56 | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1yjZFJoJeDFIfsxh9x10XGqCb8s2-Gtbp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AseY3CKmJvlxoSwXnxHEwA)(wyck) |
-| swin_base_patch4_window12_384  | 86.43 | 98.07 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1ThmGsTDZ8217-Zuo9o5EGLfzw8AI6N0w/view?usp=sharing)/[baidu](https://pan.baidu.com/s/10E3F9jqBeBTcasIvJ8iMzg)(4a95) |
-| swin_large_patch4_window12_384 | 87.14 | 98.23 | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1f30Mt80g5yLfEiViT4-kMLpyDjTUTV5B/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1w5w8QNfg0zY3nSfGs-Tx3A)(j71u) |
-
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| swin_t_224   					| 81.37 | 95.54 | 28.3M   | 4.4G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1v_wzWv3TaQ0RKkKwRQwuDPzwpOb_jGEs/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tbc751RVh3fIRsrLzrmeOw)(h2ac) |
+| swin_s_224   					| 83.21 | 96.32 | 49.6M   | 8.6G   | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1lrODzr8zIOU9sBrH2x3zolMOS4mv4o7x/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1rlXL0tjLWbWnkIt_2Ne8Jw)(ydyx) |
+| swin_b_224   					| 83.60 | 96.46 | 87.7M   | 15.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1hjEVODThNEDAlIqkg8C1KzUh3KsVNu6R/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ucSHBiuiG2sHAmR1N1JENQ)(h4y6) |
+| swin_b_384   					| 84.48 | 96.89 | 87.7M   | 45.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1szLgwhB6WJu02Me6Uyz94egk8SqKlNsd/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1t0oXbqKNwpUAMJV7VTzcNw)(7nym) |
+| swin_b_224_22kto1k    		| 85.27 | 97.56 | 87.7M   | 15.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1FhdlheMUlJzrZ7EQobpGRxd3jt3aQniU/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KBocL_M6YNW1ZsK-GYFiNw)(6ur8) |
+| swin_b_384_22kto1k    		| 86.43 | 98.07 | 87.7M   | 45.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zVwIrJmtuBSiSVQhUeblRQzCKx-yWNCA/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1NziwdsEJtmjfGCeUFgtZXA)(9squ) |
+| swin_l_224_22kto1k    		| 86.32 | 97.90 | 196.4M  | 34.3G  | 224        | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1yo7rkxKbQ4izy2pY5oQ5QAnkyv7zKcch/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1GsUJbSkGxlGsBYsayyKjVg)(nd2f) |
+| swin_l_384_22kto1k    		| 87.14 | 98.23 | 196.4M  | 100.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1-6DEvkb-FMz72MyKtq9vSPKYBqINxoKK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JLdS0aTl3I37oDzGKLFSqA)(5g5e) |
 > *The results are evaluated on ImageNet2012 validation set.
 
+### Models trained from scratch using PaddleViT
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| swin_t_224   					| 79.67 | 94.72 | 28.3M   | 4.4G   | 224        | 0.9      | bicubic       | coming soon |
+
 ## Notebooks
 We provide a few notebooks in aistudio to help you get started:
 
@@ -65,8 +78,8 @@ from swin import build_swin as build_model
 config = get_config('./configs/swin_base_patch4_window7_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./swin_base_patch4_window7_224')
+# load pretrained weights
+model_state_dict = paddle.load('./swin_base_patch4_window7_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -79,12 +92,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/swin_base_patch4_window7_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/swin_base_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./swin_base_patch4_window7_224'
+    -pretrained=/path/to/pretrained/model/swin_base_patch4_window7_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -101,12 +114,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/swin_base_patch4_window7_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/swin_base_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./swin_base_patch4_window7_224'
+    -pretrained=/path/to/pretrained/model/swin_base_patch4_window7_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -121,10 +134,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_singel_gpu.py \
-  -cfg='./configs/swin_base_patch4_window7_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/swin_base_patch4_window7_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 <details>
@@ -141,10 +154,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/swin_base_patch4_window7_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/swin_base_patch4_window7_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/SwinTransformer/__init__.py b/image_classification/SwinTransformer/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/SwinTransformer/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/SwinTransformer/augment.py b/image_classification/SwinTransformer/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/SwinTransformer/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/SwinTransformer/augmentation.py b/image_classification/SwinTransformer/augmentation.py
deleted file mode 100644
index 811a0cea..00000000
--- a/image_classification/SwinTransformer/augmentation.py
+++ /dev/null
@@ -1,3 +0,0 @@
-import paddle
-import paddle.nn as nn
-
diff --git a/image_classification/SwinTransformer/config.py b/image_classification/SwinTransformer/config.py
index 871d7858..6a041129 100644
--- a/image_classification/SwinTransformer/config.py
+++ b/image_classification/SwinTransformer/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -35,7 +35,9 @@
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size
 _C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -69,10 +71,11 @@
 _C.TRAIN.WARMUP_EPOCHS = 20
 _C.TRAIN.WEIGHT_DECAY = 0.05
 _C.TRAIN.BASE_LR = 0.001
-_C.TRAIN.WARMUP_START_LR = 0.0
-_C.TRAIN.END_LR = 0.0
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -86,28 +89,33 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
-# augmentation
-_C.AUG = CN()
-_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
-_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
-_C.AUG.RE_PROB = 0.25 # random earse prob
-_C.AUG.RE_MODE = 'pixel' # random earse mode
-_C.AUG.RE_COUNT = 1 # random earse count
-_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
-_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
-_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
-_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
-_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
-_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
 
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 20 # freq to save chpt
+_C.SAVE_FREQ = 1 # freq to save chpt
 _C.REPORT_FREQ = 50 # freq to logging info
-_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.VALIDATE_FREQ = 10 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -125,6 +133,7 @@ def _update_config_from_file(config, cfg_file):
     config.merge_from_file(cfg_file)
     config.freeze()
 
+
 def update_config(config, args):
     """Update config by ArgumentParser
     Args:
@@ -141,8 +150,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -154,6 +167,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/SwinTransformer/configs/swin_base_patch4_window12_384_21k_1k.yaml b/image_classification/SwinTransformer/configs/swin_base_patch4_window12_384_21k_1k.yaml
new file mode 100644
index 00000000..90b01a6f
--- /dev/null
+++ b/image_classification/SwinTransformer/configs/swin_base_patch4_window12_384_21k_1k.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: swin
+    NAME: swin_base_patch4_window12_384
+    DROP_PATH: 0.5
+    TRANS:
+        PATCH_SIZE: 4
+        WINDOW_SIZE: 12
+        EMBED_DIM: 128
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [4, 8, 16, 32]
diff --git a/image_classification/SwinTransformer/configs/swin_base_patch4_window7_224_21k_1k.yaml b/image_classification/SwinTransformer/configs/swin_base_patch4_window7_224_21k_1k.yaml
new file mode 100644
index 00000000..9a1d075e
--- /dev/null
+++ b/image_classification/SwinTransformer/configs/swin_base_patch4_window7_224_21k_1k.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: swin
+    NAME: swin_base_patch4_window7_224
+    DROP_PATH: 0.5
+    TRANS:
+        EMBED_DIM: 128
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [4, 8, 16, 32]
+        WINDOW_SIZE: 7
+        PATCH_SIZE: 4
+
diff --git a/image_classification/SwinTransformer/configs/swin_large_patch4_window7_224.yaml b/image_classification/SwinTransformer/configs/swin_large_patch4_window7_224.yaml
new file mode 100644
index 00000000..58069f47
--- /dev/null
+++ b/image_classification/SwinTransformer/configs/swin_large_patch4_window7_224.yaml
@@ -0,0 +1,13 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.9
+MODEL:
+    TYPE: swin
+    NAME: swin_large_patch4_window7_224
+    TRANS:
+        EMBED_DIM: 192
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [6, 12, 24, 48]
+        WINDOW_SIZE: 7
+        PATCH_SIZE: 4
+
diff --git a/image_classification/SwinTransformer/configs/swin_small_patch4_window7_224.yaml b/image_classification/SwinTransformer/configs/swin_small_patch4_window7_224.yaml
new file mode 100644
index 00000000..3d7984dc
--- /dev/null
+++ b/image_classification/SwinTransformer/configs/swin_small_patch4_window7_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.90
+MODEL:
+    TYPE: swin
+    NAME: swin_small_patch4_window7_224
+    DROP_PATH: 0.3
+    TRANS:
+        EMBED_DIM: 96
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [3, 6, 12, 24]
+        WINDOW_SIZE: 7
+        PATCH_SIZE: 4
+
diff --git a/image_classification/SwinTransformer/configs/swin_tiny_patch4_window7_224.yaml b/image_classification/SwinTransformer/configs/swin_tiny_patch4_window7_224.yaml
new file mode 100644
index 00000000..ea71c88e
--- /dev/null
+++ b/image_classification/SwinTransformer/configs/swin_tiny_patch4_window7_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: swin
+    NAME: swin_tiny_patch4_window7_224
+    DROP_PATH: 0.2
+    TRANS:
+        EMBED_DIM: 96
+        STAGE_DEPTHS: [2, 2, 6, 2]
+        NUM_HEADS: [3, 6, 12, 24]
+        WINDOW_SIZE: 7
+        PATCH_SIZE: 4
+
diff --git a/image_classification/SwinTransformer/datasets.py b/image_classification/SwinTransformer/datasets.py
index 6472a6b5..304df9a3 100644
--- a/image_classification/SwinTransformer/datasets.py
+++ b/image_classification/SwinTransformer/datasets.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -19,8 +19,20 @@
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -80,13 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -109,8 +144,7 @@ def get_val_transforms(config):
         transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
@@ -131,11 +165,13 @@ def get_dataset(config, mode='train'):
         if mode == 'train':
             dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
         else:
+            mode = 'test'
             dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
     elif config.DATA.DATASET == "cifar100":
         if mode == 'train':
             dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
         else:
+            mode = 'test'
             dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
     elif config.DATA.DATASET == "imagenet2012":
         if mode == 'train':
diff --git a/image_classification/SwinTransformer/losses.py b/image_classification/SwinTransformer/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/SwinTransformer/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/SwinTransformer/main_multi_gpu.py b/image_classification/SwinTransformer/main_multi_gpu.py
index 5992f6c7..66de5514 100644
--- a/image_classification/SwinTransformer/main_multi_gpu.py
+++ b/image_classification/SwinTransformer/main_multi_gpu.py
@@ -25,55 +25,56 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from swin_transformer import build_swin as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from swin_transformer import build_swin as build_model
 
 
-parser = argparse.ArgumentParser('Swin')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Swin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -81,18 +82,28 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
         train_loss_meter.avg
         train_acc_meter.avg
@@ -101,63 +112,120 @@ def train(dataloader,
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -172,56 +240,140 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -243,7 +395,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -274,76 +428,120 @@ def main_worker(*args):
                 'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -351,15 +549,33 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/SwinTransformer/main_single_gpu.py b/image_classification/SwinTransformer/main_single_gpu.py
index 5f9d9373..922bee47 100644
--- a/image_classification/SwinTransformer/main_single_gpu.py
+++ b/image_classification/SwinTransformer/main_single_gpu.py
@@ -1,4 +1,3 @@
-
 #   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -27,54 +26,54 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from swin_transformer import build_swin as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from swin_transformer import build_swin as build_model
 
 
-parser = argparse.ArgumentParser('Swin')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Swin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -82,56 +81,82 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -140,19 +165,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -177,7 +203,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -189,23 +215,77 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -214,8 +294,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -227,9 +306,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -258,52 +337,61 @@ def main():
                 'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
-        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -315,9 +403,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
diff --git a/image_classification/SwinTransformer/mixup.py b/image_classification/SwinTransformer/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/SwinTransformer/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/SwinTransformer/port_weights/__init__.py b/image_classification/SwinTransformer/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/SwinTransformer/random_erasing.py b/image_classification/SwinTransformer/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/SwinTransformer/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/SwinTransformer/run_eval_multi_tiny.sh b/image_classification/SwinTransformer/run_eval_multi_tiny.sh
new file mode 100644
index 00000000..14472fdc
--- /dev/null
+++ b/image_classification/SwinTransformer/run_eval_multi_tiny.sh
@@ -0,0 +1,13 @@
+#CUDA_VISIBLE_DEVICES=0,1,2,3 \
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/swin_tiny_patch4_window7_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=256 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./output/train-20211006-19-17-58/swin-Epoch-298-Loss-3.0057902509114243' \
+#-pretrained='./output/train-20210929-21-17-57/swin-Epoch-286-Loss-3.018214573891194' \
+#-pretrained='./output/train-20210929-21-17-57/swin-Epoch-298-Loss-3.021707329043735' \
+#-pretrained='./output/train-20210929-21-17-57/swin-Epoch-150-Loss-3.256427281403651' \
+#-pretrained='./output/train-20210929-21-17-57/swin-Epoch-128-Loss-3.30339895277557' \
diff --git a/image_classification/SwinTransformer/run_train.sh b/image_classification/SwinTransformer/run_train.sh
index 016141c3..bfb8b070 100644
--- a/image_classification/SwinTransformer/run_train.sh
+++ b/image_classification/SwinTransformer/run_train.sh
@@ -1,6 +1,6 @@
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
--cfg='./configs/swin_base_patch4_window7_224.yaml' \
+-cfg='./configs/swin_tiny_patch4_window7_224.yaml' \
 -dataset='imagenet2012' \
 -batch_size=4 \
 -data_path='/dataset/imagenet' \
diff --git a/image_classification/SwinTransformer/run_train_multi.sh b/image_classification/SwinTransformer/run_train_multi.sh
index ef47eed2..bb9b7f5b 100644
--- a/image_classification/SwinTransformer/run_train_multi.sh
+++ b/image_classification/SwinTransformer/run_train_multi.sh
@@ -1,7 +1,7 @@
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
--cfg='./configs/swin_base_patch4_window7_224.yaml' \
+-cfg='./configs/swin_tiny_patch4_window7_224.yaml' \
 -dataset='imagenet2012' \
 -batch_size=16 \
 -data_path='/dataset/imagenet' \
--ngpus=8
+#-amp
diff --git a/image_classification/SwinTransformer/run_train_multi_base.sh b/image_classification/SwinTransformer/run_train_multi_base.sh
new file mode 100644
index 00000000..472a4387
--- /dev/null
+++ b/image_classification/SwinTransformer/run_train_multi_base.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/swin_tiny_patch4_window7_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/SwinTransformer/stat_define.py b/image_classification/SwinTransformer/stat_define.py
new file mode 100644
index 00000000..8ab389b1
--- /dev/null
+++ b/image_classification/SwinTransformer/stat_define.py
@@ -0,0 +1,60 @@
+import os
+import glob
+import paddle
+from config import get_config
+from swin_transformer import build_swin as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/swin_tiny_patch4_window7_224.yaml'
+input_size = (1, 3, 224, 224)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/SwinTransformer/swin_transformer.py b/image_classification/SwinTransformer/swin_transformer.py
index 554d2c0f..2dde9459 100644
--- a/image_classification/SwinTransformer/swin_transformer.py
+++ b/image_classification/SwinTransformer/swin_transformer.py
@@ -61,7 +61,16 @@ def __init__(self, image_size=224, patch_size=4, in_channels=3, embed_dim=96):
                                      out_channels=embed_dim,
                                      kernel_size=patch_size,
                                      stride=patch_size)
-        self.norm = nn.LayerNorm(embed_dim)
+
+        w_attr, b_attr = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(embed_dim,
+                                 weight_attr=w_attr,
+                                 bias_attr=b_attr)
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
     def forward(self, x):
         x = self.patch_embed(x) # [batch, embed_dim, h, w] h,w = patch_resolution
@@ -89,8 +98,26 @@ def __init__(self, input_resolution, dim):
         super(PatchMerging, self).__init__()
         self.input_resolution = input_resolution
         self.dim = dim
-        self.reduction = nn.Linear(4*dim, 2*dim, bias_attr=False)
-        self.norm = nn.LayerNorm(4*dim)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.reduction = nn.Linear(4 * dim,
+                                   2 * dim,
+                                   weight_attr=w_attr_1,
+                                   bias_attr=False)
+
+        w_attr_2, b_attr_2 = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(4*dim,
+                                 weight_attr=w_attr_2,
+                                 bias_attr=b_attr_2)
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
     def forward(self, x):
         h, w = self.input_resolution
@@ -141,8 +168,8 @@ def __init__(self, in_features, hidden_features, dropout):
         self.dropout = nn.Dropout(dropout)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
-        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
@@ -194,7 +221,7 @@ def __init__(self,
         coords_w = paddle.arange(self.window_size[1])
         coords = paddle.stack(paddle.meshgrid([coords_h, coords_w])) # [2, window_h, window_w]
         coords_flatten = paddle.flatten(coords, 1) # [2, window_h * window_w]
-        # 2, window_h * window_w, window_h * window_h
+        # 2, window_h * window_w, window_h * window_w
         relative_coords = coords_flatten.unsqueeze(2) - coords_flatten.unsqueeze(1)
         # winwod_h*window_w, window_h*window_w, 2
         relative_coords = relative_coords.transpose([1, 2, 0])
@@ -205,12 +232,27 @@ def __init__(self,
         relative_position_index = relative_coords.sum(-1)
         self.register_buffer("relative_position_index", relative_position_index)
 
-        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.qkv = nn.Linear(dim,
+                             dim * 3,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else False)
+
         self.attn_dropout = nn.Dropout(attention_dropout)
-        self.proj = nn.Linear(dim, dim)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.proj = nn.Linear(dim,
+                              dim,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
         self.proj_dropout = nn.Dropout(dropout)
         self.softmax = nn.Softmax(axis=-1)
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def transpose_multihead(self, x):
         new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
         x = x.reshape(new_shape)
@@ -228,21 +270,21 @@ def get_relative_pos_bias_from_pos_index(self):
         return relative_position_bias
 
     def forward(self, x, mask=None):
-        qkv = self.qkv(x).chunk(3, axis=-1)
-        q, k, v = map(self.transpose_multihead, qkv)
+        qkv = self.qkv(x).chunk(3, axis=-1)     # {list:3}
+        q, k, v = map(self.transpose_multihead, qkv)       # [512,3,49,32] -> [128,6,49,32]-> [32,12,49,32]->[8,24,49,32]
         q = q * self.scale
-        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = paddle.matmul(q, k, transpose_y=True)        # [512,3,49,49] -> [128,6,49,49] -> [32,12,49,49] -> [8,24,49,49]
 
-        relative_position_bias = self.get_relative_pos_bias_from_pos_index()
+        relative_position_bias = self.get_relative_pos_bias_from_pos_index() #[2401,3]->[2401,6]->[2401,12]->[2401,24]
 
         relative_position_bias = relative_position_bias.reshape(
             [self.window_size[0] * self.window_size[1],
              self.window_size[0] * self.window_size[1],
-             -1])
+             -1])       # [49,49,3]->[49,49,6]->[49,49,12]->[49,49,24]
 
         # nH, window_h*window_w, window_h*window_w
-        relative_position_bias = relative_position_bias.transpose([2, 0, 1])
-        attn = attn + relative_position_bias.unsqueeze(0)
+        relative_position_bias = relative_position_bias.transpose([2, 0, 1])  # [3,49,49]->[6,49,49]->[12,49,49]->[24,49,49]
+        attn = attn + relative_position_bias.unsqueeze(0)   
 
         if mask is not None:
             nW = mask.shape[0]
@@ -254,14 +296,14 @@ def forward(self, x, mask=None):
         else:
             attn = self.softmax(attn)
 
-        attn = self.attn_dropout(attn)
+        attn = self.attn_dropout(attn)  # [512,3,49,49]->[128,6,49,49]->[32,12,49,49]->[8,24,49,49]
 
-        z = paddle.matmul(attn, v)
+        z = paddle.matmul(attn, v)      # [512,3,49,32]->[128,6,49,32]->[32,12,49,32]->[8,24,49,32]
         z = z.transpose([0, 2, 1, 3])
         new_shape = z.shape[:-2] + [self.dim]
         z = z.reshape(new_shape)
         z = self.proj(z)
-        z = self.proj_dropout(z)
+        z = self.proj_dropout(z)    # [512,49,96]->[128,49,192]->[32,49,384]->[8,49,768]
 
         return z
 
@@ -276,9 +318,9 @@ def windows_partition(x, window_size):
     """
 
     B, H, W, C = x.shape
-    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C])
-    x = x.transpose([0, 1, 3, 2, 4, 5])
-    x = x.reshape([-1, window_size, window_size, C]) #(num_windows*B, window_size, window_size, C)
+    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C]) # [bs,num_window,window_size,num_window,window_size,C]
+    x = x.transpose([0, 1, 3, 2, 4, 5])     # [bs,num_window,num_window,window_size,window_Size,C]
+    x = x.reshape([-1, window_size, window_size, C]) #(bs*num_windows,window_size, window_size, C)
 
     return x
 
@@ -296,9 +338,9 @@ def windows_reverse(windows, window_size, H, W):
     """
 
     B = int(windows.shape[0] / (H * W / window_size / window_size))
-    x = windows.reshape([B, H // window_size, W // window_size, window_size, window_size, -1])
-    x = x.transpose([0, 1, 3, 2, 4, 5])
-    x = x.reshape([B, H, W, -1])
+    x = windows.reshape([B, H // window_size, W // window_size, window_size, window_size, -1]) # [bs,num_window,num_window,window_size,window_Size,C]
+    x = x.transpose([0, 1, 3, 2, 4, 5]) # [bs,num_window,window_size,num_window,window_size,C]
+    x = x.reshape([B, H, W, -1])  #(bs,num_windows*window_size, num_windows*window_size, C)
     return x
 
 
@@ -309,7 +351,7 @@ class SwinTransformerBlock(nn.Layer):
 
     Attributes:
         dim: int, input dimension (channels)
-        input_resolution: int, input resoultion
+        input_resolution: tuple, input resoultion
         num_heads: int, number of attention heads
         windos_size: int, window size, default: 7
         shift_size: int, shift size for SW-MSA, default: 0
@@ -335,7 +377,11 @@ def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0
             self.shift_size = 0
             self.window_size = min(self.input_resolution)
 
-        self.norm1 = nn.LayerNorm(dim)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm1 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_1,
+                                  bias_attr=b_attr_1)
+
         self.attn = WindowAttention(dim,
                                     window_size=(self.window_size, self.window_size),
                                     num_heads=num_heads,
@@ -344,7 +390,12 @@ def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0
                                     attention_dropout=attention_dropout,
                                     dropout=dropout)
         self.drop_path = DropPath(droppath) if droppath > 0. else None
-        self.norm2 = nn.LayerNorm(dim)
+
+        w_attr_2, b_attr_2 = self._init_weights_layernorm()
+        self.norm2 = nn.LayerNorm(dim,
+                                  weight_attr=w_attr_2,
+                                  bias_attr=b_attr_2)
+
         self.mlp = Mlp(in_features=dim,
                        hidden_features=int(dim*mlp_ratio),
                        dropout=dropout)
@@ -378,29 +429,34 @@ def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0
 
         self.register_buffer("attn_mask", attn_mask)
 
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         H, W = self.input_resolution
         B, L, C = x.shape
         h = x
-        x = self.norm1(x)
+        x = self.norm1(x)   # [bs,H*W,C]
 
         new_shape = [B, H, W, C]
-        x = x.reshape(new_shape)
+        x = x.reshape(new_shape) # [bs,H,W,C]
 
         if self.shift_size > 0:
             shifted_x = paddle.roll(x,
                                     shifts=(-self.shift_size, -self.shift_size),
-                                    axis=(1, 2))
+                                    axis=(1, 2))        # [bs,H,W,C]
         else:
             shifted_x = x
 
-        x_windows = windows_partition(shifted_x, self.window_size)
-        x_windows = x_windows.reshape([-1, self.window_size * self.window_size, C])
+        x_windows = windows_partition(shifted_x, self.window_size)  # [bs*num_windows,7,7,C]
+        x_windows = x_windows.reshape([-1, self.window_size * self.window_size, C]) # [bs*num_windows,7*7,C]
 
-        attn_windows = self.attn(x_windows, mask=self.attn_mask)
-        attn_windows = attn_windows.reshape([-1, self.window_size, self.window_size, C])
+        attn_windows = self.attn(x_windows, mask=self.attn_mask)    # [bs*num_windows,7*7,C]
+        attn_windows = attn_windows.reshape([-1, self.window_size, self.window_size, C])    # [bs*num_windows,7,7,C]
 
-        shifted_x = windows_reverse(attn_windows, self.window_size, H, W)
+        shifted_x = windows_reverse(attn_windows, self.window_size, H, W)   # [bs,H,W,C] 
 
         # reverse cyclic shift
         if self.shift_size > 0:
@@ -410,15 +466,15 @@ def forward(self, x):
         else:
             x = shifted_x
 
-        x = x.reshape([B, H*W, C])
+        x = x.reshape([B, H*W, C])      # [bs,H*W,C] 
 
         if self.drop_path is not None:
             x = h + self.drop_path(x)
         else:
             x = h + x
-        h = x
-        x = self.norm2(x)
-        x = self.mlp(x)
+        h = x       # [bs,H*W,C]
+        x = self.norm2(x)       # [bs,H*W,C]
+        x = self.mlp(x)         # [bs,H*W,C]
         if self.drop_path is not None:
             x = h + self.drop_path(x)
         else:
@@ -467,9 +523,9 @@ def __init__(self, dim, input_resolution, depth, num_heads, window_size,
 
     def forward(self, x):
         for block in self.blocks:
-            x = block(x)
+            x = block(x)                # [bs,56*56,96] -> [bs,28*28,96*2] -> [bs,14*14,96*4] -> [bs,7*7,96*8]
         if self.downsample is not None:
-            x = self.downsample(x)
+            x = self.downsample(x)      # [bs,28*28,96*2] -> [bs,14*14,96*4] -> [bs,7*7,96*8]
 
         return x
 
@@ -564,28 +620,46 @@ def __init__(self,
                 )
             self.stages.append(stage)
 
-        self.norm = nn.LayerNorm(self.num_features)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(self.num_features,
+                                 weight_attr=w_attr_1,
+                                 bias_attr=b_attr_1)
+
         self.avgpool = nn.AdaptiveAvgPool1D(1)
-        self.fc = nn.Linear(self.num_features, self.num_classes)
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc = nn.Linear(self.num_features,
+                            self.num_classes,
+                            weight_attr=w_attr_2,
+                            bias_attr=b_attr_2)
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
 
     def forward_features(self, x):
-        x = self.patch_embedding(x)
+        x = self.patch_embedding(x)     # [bs,H*W,96]
         if self.ape:
             x = x + self.absolute_positional_embedding
-        x = self.position_dropout(x)
+        x = self.position_dropout(x)    # [bs,H*W,96]
 
         for stage in self.stages:
-            x = stage(x)
+            x = stage(x)        # [bs,784,192],[bs,196,384],[bs,49,768],[bs,49,768]
 
-        x = self.norm(x)
+        x = self.norm(x)        # [bs,49,768]
         x = x.transpose([0, 2, 1])
-        x = self.avgpool(x)
-        x = x.flatten(1)
+        x = self.avgpool(x)     # [bs,768,1]
+        x = x.flatten(1)        # [bs,768]
         return x
 
     def forward(self, x):
-        x = self.forward_features(x)
-        x = self.fc(x)
+        x = self.forward_features(x)        # [bs,768]
+        x = self.fc(x)                  # [bs,1000]
         return x
 
 
diff --git a/image_classification/SwinTransformer/transforms.py b/image_classification/SwinTransformer/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/SwinTransformer/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/T2T_ViT/README.md b/image_classification/T2T_ViT/README.md
index b05e326f..5c0dec48 100644
--- a/image_classification/T2T_ViT/README.md
+++ b/image_classification/T2T_ViT/README.md
@@ -1,4 +1,4 @@
-# Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [arxiv](https://arxiv.org/abs/2106.13797) 
+# Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [arxiv](https://arxiv.org/abs/2101.11986) 
 
 PaddlePaddle training/validation code and pretrained models for **T2T-ViT**.
 
@@ -14,21 +14,22 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 
 
 ### Update 
-Update (2021-08-18): Code is released and ported weights are uploaded.
+- Update (2021-09-27): Model FLOPs and # params are uploaded.
+- Update (2021-08-18): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| t2t_vit_7      | 71.68 | 90.89 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YkuPs1ku7B_udydOf_ls1LQvpJDg_c_j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jVNsz37gatLCDaOoU3NaMA)(1hpa) |
-| t2t_vit_10     | 75.15 | 92.80 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1H--55RxliMDlOCekn7FpKrHDGsUkyrJZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nbdb4PFMq4nsIp8HrNxLQg)(ixug) |
-| t2t_vit_12     | 76.48 | 93.49 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1stnIwOwaescaEcztaF1QjI4NK4jaqN7P/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DcMzq9WeSwrS3epv6jKJXw)(qpbb) |
-| t2t_vit_14     | 81.50 | 95.67 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1HSvN3Csgsy7SJbxJYbkzjUx9guftkfZ1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wcfh22uopBv7pS7rKcH_iw)(c2u8) |
-| t2t_vit_19     | 81.93 | 95.74 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1eFnhaL6I33pHCQw2BaEE0Oet9CnjmUf_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
-| t2t_vit_24     | 82.28 | 95.89 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Z7nZCHeFp0AhIkGYcMAFkKdkGN0yXtpv/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
-| t2t_vit_t_14   | 81.69 | 95.85 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/16li4voStt_B8eWDXqJt7s20OT_Z8L263/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
-| t2t_vit_t_19   | 82.44 | 96.08 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Ty-42SYOu15Nk8Uo6VRTJ7J0JV_6t7zJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YdQd6l8tj5xMCWvcHWm7sg)(mier) |
-| t2t_vit_t_24   | 82.55 | 96.07 | 224   | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1cvvXrGr2buB8Np2WlVL7n_F1_CnI1qow/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BMU3KX_TRmPxQ1jN5cmWhg)(6vxc) |
-| t2t_vit_14_384 | 83.34 | 96.50 | 384   | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Yuso8WD7Q8Lu_9I8dTvAvkcXXtPSkmnm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AOMhyVRF9zPqJe-lTrd7pw)(r685) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| t2t_vit_7      				| 71.68 | 90.89 | 4.3M    | 1.0G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1YkuPs1ku7B_udydOf_ls1LQvpJDg_c_j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jVNsz37gatLCDaOoU3NaMA)(1hpa) |
+| t2t_vit_10     				| 75.15 | 92.80 | 5.8M    | 1.3G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1H--55RxliMDlOCekn7FpKrHDGsUkyrJZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nbdb4PFMq4nsIp8HrNxLQg)(ixug) |
+| t2t_vit_12     				| 76.48 | 93.49 | 6.9M    | 1.5G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1stnIwOwaescaEcztaF1QjI4NK4jaqN7P/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DcMzq9WeSwrS3epv6jKJXw)(qpbb) |
+| t2t_vit_14     				| 81.50 | 95.67 | 21.5M   | 4.4G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1HSvN3Csgsy7SJbxJYbkzjUx9guftkfZ1/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1wcfh22uopBv7pS7rKcH_iw)(c2u8) |
+| t2t_vit_19     				| 81.93 | 95.74 | 39.1M   | 7.8G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1eFnhaL6I33pHCQw2BaEE0Oet9CnjmUf_/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_24     				| 82.28 | 95.89 | 64.0M   | 12.8G  | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Z7nZCHeFp0AhIkGYcMAFkKdkGN0yXtpv/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_t_14   				| 81.69 | 95.85 | 21.5M   | 4.4G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/16li4voStt_B8eWDXqJt7s20OT_Z8L263/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Hpyc5hBYo1zqoXWpryegnw)(4in3) |
+| t2t_vit_t_19   				| 82.44 | 96.08 | 39.1M   | 7.9G   | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1Ty-42SYOu15Nk8Uo6VRTJ7J0JV_6t7zJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1YdQd6l8tj5xMCWvcHWm7sg)(mier) |
+| t2t_vit_t_24   				| 82.55 | 96.07 | 64.0M   | 12.9G  | 224   	    | 0.9      | bicubic       | [google](https://drive.google.com/file/d/1cvvXrGr2buB8Np2WlVL7n_F1_CnI1qow/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1BMU3KX_TRmPxQ1jN5cmWhg)(6vxc) |
+| t2t_vit_14_384 				| 83.34 | 96.50 | 21.5M   | 13.0G  | 384   	    | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Yuso8WD7Q8Lu_9I8dTvAvkcXXtPSkmnm/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1AOMhyVRF9zPqJe-lTrd7pw)(r685) |
 
 > *The results are evaluated on ImageNet2012 validation set.
 ## Notebooks
@@ -72,8 +73,8 @@ from t2t_vit import build_t2t_vit as build_model
 config = get_config('./configs/t2t_vit_7.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./t2t_vit_7')
+# load pretrained weights
+model_state_dict = paddle.load('./t2t_vit_7.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -86,12 +87,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/t2t_vit_7.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/t2t_vit_7.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./t2t_vit_7'
+    -pretrained=/path/to/pretrained/model/t2t_vit_7  # .pdparams is NOT needed
 ```
 
 <details>
@@ -108,12 +109,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/t2t_vit_7.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/t2t_vit_7.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./t2t_vit_7'
+    -pretrained=/path/to/pretrained/model/t2t_vit_7  # .pdparams is NOT needed
 ```
 
 </details>
@@ -128,10 +129,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/t2t_vit_7.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/t2t_vit_7.yaml \
+  -dataset=imagenet2012 \
   -batch_size=16 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 <details>
@@ -148,10 +149,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 python main_multi_gpu.py \
-    -cfg='./configs/t2t_vit_7.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/t2t_vit_7.yaml \
+    -dataset=imagenet2012 \
     -batch_size=32 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/T2T_ViT/__init__.py b/image_classification/T2T_ViT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/T2T_ViT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/T2T_ViT/augment.py b/image_classification/T2T_ViT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/T2T_ViT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/T2T_ViT/config.py b/image_classification/T2T_ViT/config.py
index 3cfecca4..c4eba120 100644
--- a/image_classification/T2T_ViT/config.py
+++ b/image_classification/T2T_ViT/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -17,8 +17,8 @@
 Configuration for data, model archtecture, and training, etc.
 Config can be set by .yaml file or by argparser(limited usage)
 
-
 """
+
 import os
 from yacs.config import CfgNode as CN
 import yaml
@@ -28,13 +28,15 @@
 
 # data settings
 _C.DATA = CN()
-_C.DATA.BATCH_SIZE = 256 #256 # train batch_size for single GPU
-_C.DATA.BATCH_SIZE_EVAL = 8 #64 # val batch_size for single GPU
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
 _C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
 _C.DATA.DATASET = 'imagenet2012' # dataset name
-_C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
-_C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads 
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -43,7 +45,8 @@
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.DROPPATH = 0.1
 _C.MODEL.ATTENTION_DROPOUT = 0.0
 
 # transformer settings
@@ -59,14 +62,17 @@
 # training settings
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
-_C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 5e-4
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.NUM_EPOCHS = 310
+_C.TRAIN.WARMUP_EPOCHS = 5
+_C.TRAIN.WEIGHT_DECAY = 3e-2
+_C.TRAIN.BASE_LR = 1e-3
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 1e-5
+_C.TRAIN.GRAD_CLIP = None
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.MODEL_EMA = True
+_C.TRAIN.MODEL_EMA_DECAY = 0.99996
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -80,14 +86,34 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8 # mixup alpha, enabled if >0
+_C.TRAIN.CUTMIX_ALPHA = 1.0 # cutmix alpha, enabled if >0
+_C.TRAIN.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.TRAIN.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.TRAIN.MIXUP_MODE = 'batch' # how to apply mixup/cutmix params, per 'batch', 'pair' or 'elem'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 10 # freq to save chpt
-_C.REPORT_FREQ = 100 # freq to logging info
-_C.VALIDATE_FREQ = 50 # freq to do validation
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -121,8 +147,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -134,6 +164,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/T2T_ViT/configs/t2t_vit_14_384.yaml_bk b/image_classification/T2T_ViT/configs/t2t_vit_14_384.yaml_bk
new file mode 100644
index 00000000..df83aff5
--- /dev/null
+++ b/image_classification/T2T_ViT/configs/t2t_vit_14_384.yaml_bk
@@ -0,0 +1,23 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: T2T-ViT
+    NAME: t2t-vit-14-384
+    TRANS:
+        EMBED_DIM: 384 
+        DEPTH: 14
+        NUM_HEADS: 6
+        MLP_RATIO: 3.0
+        QKV_BIAS: False
+        QK_SCALE: None
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
+
+
diff --git a/image_classification/T2T_ViT/configs/t2t_vit_7.yaml b/image_classification/T2T_ViT/configs/t2t_vit_7.yaml
index 0ff59bef..74e82823 100644
--- a/image_classification/T2T_ViT/configs/t2t_vit_7.yaml
+++ b/image_classification/T2T_ViT/configs/t2t_vit_7.yaml
@@ -13,11 +13,11 @@ MODEL:
         QK_SCALE: None #256 ** -0.5
 TRAIN:
     NUM_EPOCHS: 300
-    WARMUP_EPOCHS: 3
-    WEIGHT_DECAY: 0.3
-    BASE_LR: 0.003
+    WARMUP_EPOCHS: 10
+    WEIGHT_DECAY: 0.03
+    BASE_LR: 1e-3
     WARMUP_START_LR: 1e-6
-    END_LR: 5e-4
-    ACCUM_ITER: 2
+    END_LR: 1e-5
+    ACCUM_ITER: 1
 
 
diff --git a/image_classification/T2T_ViT/datasets.py b/image_classification/T2T_ViT/datasets.py
index 78a3db09..7e178b57 100644
--- a/image_classification/T2T_ViT/datasets.py
+++ b/image_classification/T2T_ViT/datasets.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -19,8 +19,20 @@
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -60,7 +72,7 @@ def __len__(self):
         return len(self.label_list)
 
     def __getitem__(self, index):
-        data = image_load(self.img_path_list[index]).convert('RGB')
+        data = Image.open(self.img_path_list[index]).convert('RGB')
         data = self.transform(data)
         label = self.label_list[index]
 
@@ -80,13 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -106,11 +141,10 @@ def get_val_transforms(config):
 
     scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
     transforms_val = transforms.Compose([
-        transforms.Resize(scale_size, 'bicubic'),
+        transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/T2T_ViT/droppath.py b/image_classification/T2T_ViT/droppath.py
index 25b8d5ff..08472aea 100644
--- a/image_classification/T2T_ViT/droppath.py
+++ b/image_classification/T2T_ViT/droppath.py
@@ -16,6 +16,7 @@
 Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
 """
 
+import numpy as np
 import paddle
 import paddle.nn as nn
 
@@ -43,7 +44,7 @@ def drop_path(self, inputs):
         shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
         random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
         random_tensor = random_tensor.floor() # mask
-        output = inputs.divide(keep_prob) * random_tensor #divide is to keep same output expectation
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
         return output
 
     def forward(self, inputs):
@@ -53,8 +54,9 @@ def forward(self, inputs):
 #def main():
 #    tmp = paddle.to_tensor(np.random.rand(8, 16, 8, 8), dtype='float32')
 #    dp = DropPath(0.5)
-#    out = dp(tmp)
-#    print(out)
+#    for i in range(100):
+#        out = dp(tmp)
+#        print(out)
 #
 #if __name__ == "__main__":
 #    main()
diff --git a/image_classification/T2T_ViT/load_pth_weights/__init__.py b/image_classification/T2T_ViT/load_pth_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/T2T_ViT/losses.py b/image_classification/T2T_ViT/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/T2T_ViT/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/T2T_ViT/main_multi_gpu.py b/image_classification/T2T_ViT/main_multi_gpu.py
index 616ee793..59719268 100644
--- a/image_classification/T2T_ViT/main_multi_gpu.py
+++ b/image_classification/T2T_ViT/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,13 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""T2T-ViT Transformer training/validation using multiple GPU """
+"""T2T-ViT training/validation using multiple GPU """
 
 import sys
 import os
 import time
 import logging
-import copy
 import argparse
 import random
 import numpy as np
@@ -28,51 +27,56 @@
 import paddle.distributed as dist
 from datasets import get_dataloader
 from datasets import get_dataset
-from t2t_vit import build_t2t_vit as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from t2t_vit import build_t2t_vit as build_model
 
 
-parser = argparse.ArgumentParser('T2T-ViT Transformer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('T2T-ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,83 +84,157 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+        if model_ema is not None and dist.get_rank() == 0:
+            model_ema.update(model)
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -171,56 +249,144 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA and local_rank == 0:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -242,7 +408,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -269,79 +437,131 @@ def main_worker(*args):
             weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip,
-            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn(['cls_token']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            local_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+            if local_rank == 0:
+                master_logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            model_ema=model_ema,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -349,15 +569,38 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if model_ema is not None:
+                    model_ema_path = os.path.join(
+                        config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                    paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                    master_logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/T2T_ViT/main_single_gpu.py b/image_classification/T2T_ViT/main_single_gpu.py
index 00ed8711..4c68fcef 100644
--- a/image_classification/T2T_ViT/main_single_gpu.py
+++ b/image_classification/T2T_ViT/main_single_gpu.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,12 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""T2T-ViT Transformer training/validation using single GPU """
+"""T2T-ViT training/validation using single GPU """
 
 import sys
 import os
 import time
 import logging
+import copy
 import argparse
 import random
 import numpy as np
@@ -26,53 +27,56 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from t2t_vit import build_t2t_vit as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from model_ema import ModelEma
+from t2t_vit import build_t2t_vit as build_model
 
 
-parser = argparse.ArgumentParser('T2T-ViT Transformer')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('T2T-ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,56 +84,87 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          model_ema=None,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        model_ema: ModelEma, model moving average instance
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        if model_ema is not None:
+            model_ema.update(model)
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -138,19 +173,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -175,7 +211,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -187,24 +223,81 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+    # define model ema
+    model_ema = None
+    if not config.EVAL and config.TRAIN.MODEL_EMA:
+        model_ema = ModelEma(model, decay=config.TRAIN.MODEL_EMA_DECAY)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -213,8 +306,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -226,9 +318,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -248,58 +340,76 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
-            grad_clip=clip)
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+            #    'absolute_pos_embed', 'relative_position_bias_table']),
+            )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
-        optimizer.set_dict(opt_state)
+        optimizer.set_state_dict(opt_state)
         logger.info(
-            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        # load ema model
+        if model_ema is not None and os.path.isfile(config.MODEL.RESUME + '-EMA.pdparams'):
+            model_ema_state = paddle.load(config.MODEL.RESUME + '-EMA.pdparams')
+            model_ema.module.set_state_dict(model_ema_state)
+            logger.info(f'----- Load model ema from {config.MODEL.RESUME}-EMA.pdparams')
+    
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  model_ema=model_ema,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -311,9 +421,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
@@ -327,6 +438,11 @@ def main():
             paddle.save(optimizer.state_dict(), model_path + '.pdopt')
             logger.info(f"----- Save model: {model_path}.pdparams")
             logger.info(f"----- Save optim: {model_path}.pdopt")
+            if model_ema is not None:
+                model_ema_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}-EMA")
+                paddle.save(model_ema.state_dict(), model_ema_path + '.pdparams')
+                logger.info(f"----- Save ema model: {model_ema_path}.pdparams")
 
 
 if __name__ == "__main__":
diff --git a/image_classification/T2T_ViT/mixup.py b/image_classification/T2T_ViT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/T2T_ViT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/T2T_ViT/model_ema.py b/image_classification/T2T_ViT/model_ema.py
new file mode 100644
index 00000000..d12383b2
--- /dev/null
+++ b/image_classification/T2T_ViT/model_ema.py
@@ -0,0 +1,62 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement the Exponential Model Averaging
+This is paddle hack from:
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/utils/model_ema.py
+"""
+
+import copy
+from collections import OrderedDict
+import paddle
+import paddle.nn as nn
+
+
+class ModelEma:
+    """Model Ema
+    A moving average is kept of model weights and buffers.
+    Note that for multiple gpu, ema must be defined after mode init,
+    but before DataParallel.
+
+    Args:
+        model: nn.Layer, original modela with learnable params
+        decay: float, decay rate for each update, default: 0.999
+    """
+    def __init__(self, model, decay=0.999):
+        self.module = copy.deepcopy(model)
+        self.module.eval()
+        self.module.to('cpu')
+        self.decay = decay
+
+    @paddle.no_grad()
+    def _update(self, model, update_fn):
+        # update ema model parameters by model parameters
+        for (_, ema_param), (_, model_param) in zip(
+            self.module.named_parameters(), model.named_parameters()):
+            ema_param.set_value(copy.deepcopy(update_fn(ema_param, model_param)))
+            
+        # update ema model buffers by model buffers
+        for (_, ema_buf), (_, model_buf) in zip(
+            self.module.named_buffers(), model.named_buffers()):
+            ema_buf.set_value(copy.deepcopy(update_fn(ema_buf, model_buf)))
+
+    def update(self, model):
+        self._update(model, update_fn=lambda e, m: self.decay * e  + (1 - self.decay) * m)
+
+    def set(self, model):
+        self._update(model, update_fn=lambda e, m: m)
+
+    def state_dict(self):
+        return self.module.state_dict()
+
diff --git a/image_classification/T2T_ViT/random_erasing.py b/image_classification/T2T_ViT/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/T2T_ViT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/T2T_ViT/run_train.sh b/image_classification/T2T_ViT/run_train.sh
index 65ee7da8..e1799e57 100644
--- a/image_classification/T2T_ViT/run_train.sh
+++ b/image_classification/T2T_ViT/run_train.sh
@@ -4,3 +4,4 @@ python main_single_gpu.py \
 -dataset='imagenet2012' \
 -batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/T2T_ViT/run_train_multi.sh b/image_classification/T2T_ViT/run_train_multi.sh
index cbdcb75a..c6d502b5 100644
--- a/image_classification/T2T_ViT/run_train_multi.sh
+++ b/image_classification/T2T_ViT/run_train_multi.sh
@@ -4,3 +4,4 @@ python main_multi_gpu.py \
 -dataset='imagenet2012' \
 -batch_size=16 \
 -data_path='/dataset/imagenet' \
+#-amp
diff --git a/image_classification/T2T_ViT/stat.py b/image_classification/T2T_ViT/stat.py
new file mode 100644
index 00000000..514bc66e
--- /dev/null
+++ b/image_classification/T2T_ViT/stat.py
@@ -0,0 +1,64 @@
+import os
+import glob
+import paddle
+from config import get_config
+from t2t_vit import build_t2t_vit as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in sorted(glob.glob('./configs/*.yaml')):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 448, 448)
+    #input_size = (1, 3, 384, 384)
+    input_size = (1, 3, 224, 224)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/T2T_ViT/t2t_vit.py b/image_classification/T2T_ViT/t2t_vit.py
index 030dfe8c..549d13c8 100644
--- a/image_classification/T2T_ViT/t2t_vit.py
+++ b/image_classification/T2T_ViT/t2t_vit.py
@@ -18,11 +18,11 @@
 
 import copy
 import math
-#from scipy.stats import ortho_group
 import numpy as np
 import paddle
 import paddle.nn as nn
 from droppath import DropPath
+from utils import orthogonal
 
 
 class Identity(nn.Layer):
@@ -76,7 +76,11 @@ def __init__(self,
                                           num_heads=1,
                                           mlp_ratio=1.0)
 
-            self.proj = nn.Linear(token_dim * 3 * 3, embed_dim)
+            w_attr_1, b_attr_1 = self._init_weights() # init for linear
+            self.proj = nn.Linear(token_dim * 3 * 3,
+                                  embed_dim,
+                                  weight_attr=w_attr_1,
+                                  bias_attr=b_attr_1)
 
         elif token_type == 'performer':
             # paddle v 2.1 has bugs on nn.Unfold,
@@ -93,7 +97,11 @@ def __init__(self,
                                         in_dim=token_dim,
                                         kernel_ratio=0.5)
 
-            self.proj = nn.Linear(token_dim * 3 * 3, embed_dim)
+            w_attr_1, b_attr_1 = self._init_weights() # init for linear
+            self.proj = nn.Linear(token_dim * 3 * 3,
+                                  embed_dim,
+                                  weight_attr=w_attr_1,
+                                  bias_attr=b_attr_1)
 
         elif token_type == 'convolution': # NOTE: currently not supported!!!
             # 1st conv
@@ -120,6 +128,11 @@ def __init__(self,
         # 3 soft splits, each has stride 4, 2, 2, respectively.
         self.num_patches = (image_size // (4 * 2 * 2)) * (image_size // (4 * 2 * 2))
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         # x = self.soft_split0(x)
         # input x: [B, C, IMAGE_H, IMAGE_W]
@@ -182,8 +195,8 @@ def __init__(self, in_features, hidden_features=None, out_features=None, dropout
         self.dropout = nn.Dropout(dropout)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
-        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
@@ -223,16 +236,29 @@ def __init__(self,
         self.dim_head = dim // num_heads
         self.scale = qk_scale or self.dim_head ** -0.5
         # same as original repo
-        self.qkv = nn.Linear(dim, self.in_dim * 3, bias_attr=qkv_bias)
+        w_attr_1, b_attr_1 = self._init_weights() # init for linear
+        self.qkv = nn.Linear(dim,
+                             self.in_dim * 3,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1 if qkv_bias else False)
 
         self.attn_dropout = nn.Dropout(attention_dropout)
-        self.proj = nn.Linear(self.in_dim, self.in_dim)
+        w_attr_2, b_attr_2 = self._init_weights() # init for linear
+        self.proj = nn.Linear(self.in_dim,
+                              self.in_dim,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
         self.proj_dropout = nn.Dropout(dropout)
         self.softmax = nn.Softmax(axis=-1)
 
         # use V to do skip connection, used in TokenTransformer
         self.skip = skip_connection
 
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def transpose_multihead(self, x):
         if self.skip: # token transformer
             new_shape = x.shape[:-1] + [self.num_heads, self.in_dim]
@@ -293,7 +319,8 @@ def __init__(self,
                  attention_dropout=0.,
                  droppath=0.):
         super().__init__()
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm() # init for layernorm
+        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6, weight_attr=w_attr_1, bias_attr=b_attr_1)
         self.attn = Attention(dim,
                               num_heads=num_heads,
                               qkv_bias=qkv_bias,
@@ -301,11 +328,17 @@ def __init__(self,
                               dropout=dropout,
                               attention_dropout=attention_dropout)
         self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
-        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights_layernorm() # init for layernorm
+        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6, weight_attr=w_attr_2, bias_attr=b_attr_2)
         self.mlp = Mlp(in_features=dim,
                        hidden_features=int(dim * mlp_ratio),
                        dropout=dropout)
 
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         h = x
         x = self.norm1(x)
@@ -339,29 +372,54 @@ class TokenPerformer(nn.Layer):
     def __init__(self, dim, in_dim, num_heads=1, kernel_ratio=0.5, dropout=0.1):
         super().__init__()
         self.embed_dim = in_dim * num_heads
-        self.kqv = nn.Linear(dim, 3 * self.embed_dim)
+        w_attr_1, b_attr_1 = self._init_weights() # init for linear
+        self.kqv = nn.Linear(dim, 3 * self.embed_dim, weight_attr=w_attr_1, bias_attr=b_attr_1)
         self.dropout = nn.Dropout(dropout)
-        self.proj = nn.Linear(self.embed_dim, self.embed_dim)
+        w_attr_2, b_attr_2 = self._init_weights() # init for linear
+        self.proj = nn.Linear(self.embed_dim,
+                              self.embed_dim,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2)
         self.num_heads = num_heads
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
-        self.norm2 = nn.LayerNorm(self.embed_dim, epsilon=1e-6)
-
-        self.mlp = nn.Sequential(nn.Linear(self.embed_dim, self.embed_dim),
+        w_attr_3, b_attr_3 = self._init_weights_layernorm() # init for layernorm
+        w_attr_4, b_attr_4 = self._init_weights_layernorm() # init for layernorm
+        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6, weight_attr=w_attr_3, bias_attr=b_attr_3)
+        self.norm2 = nn.LayerNorm(self.embed_dim, epsilon=1e-6, weight_attr=w_attr_4, bias_attr=b_attr_4)
+
+        w_attr_5, b_attr_5 = self._init_weights() # init for linear
+        w_attr_6, b_attr_6 = self._init_weights() # init for linear
+        self.mlp = nn.Sequential(nn.Linear(self.embed_dim,
+                                           self.embed_dim,
+                                           weight_attr=w_attr_5,
+                                           bias_attr=b_attr_5),
                                  nn.GELU(),
-                                 nn.Linear(self.embed_dim, self.embed_dim),
+                                 nn.Linear(self.embed_dim,
+                                           self.embed_dim,
+                                           weight_attr=w_attr_6,
+                                           bias_attr=b_attr_6),
                                  nn.Dropout(dropout))
 
         self.m = int(self.embed_dim  * kernel_ratio)
 
         self.w = np.random.random(size=(int(self.embed_dim * kernel_ratio), self.embed_dim))
-        # TODO: init with orthognal matrix
-        #self.w, _ = np.linalg.qr(self.w)
+        # init with orthognal matrix
+        self.w = orthogonal(self.w)
 
         self.w = paddle.create_parameter(
             shape=[int(self.embed_dim * kernel_ratio), self.embed_dim],
             dtype='float32',
             default_initializer=nn.initializer.Assign(self.w / math.sqrt(self.m)))
 
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     # paddle version 2.1 does not support einsum
     def prm_exp(self, x):
         # x: [B, T, hs]
@@ -383,7 +441,7 @@ def single_attention(self, x):
 
         # same as einsum('bti,bi->bt, qp, kp.sum(axi=1).unsqueeze(2)')
         D = paddle.matmul(qp, kp.sum(axis=1).unsqueeze(2))
-        # same as einsum('bti,bim->bnm')
+        # same as einsum('bin,bim->bnm')
         kptv = paddle.matmul(v, kp, transpose_x=True)
         # same as einsum('bti,bni->btn')
         y = paddle.matmul(qp, kptv, transpose_y=True)
@@ -435,7 +493,8 @@ def __init__(self,
                  attention_dropout=0,
                  droppath=0.):
         super().__init__()
-        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6, weight_attr=w_attr_1, bias_attr=b_attr_1)
         self.attn = Attention(dim,
                               in_dim=in_dim,
                               num_heads=num_heads,
@@ -445,12 +504,18 @@ def __init__(self,
                               attention_dropout=attention_dropout,
                               skip_connection=True)
         self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
-        self.norm2 = nn.LayerNorm(in_dim, epsilon=1e-6)
+        w_attr_2, b_attr_2 = self._init_weights_layernorm()
+        self.norm2 = nn.LayerNorm(in_dim, epsilon=1e-6, weight_attr=w_attr_2, bias_attr=b_attr_2)
         self.mlp = Mlp(in_features=in_dim,
                        hidden_features=int(in_dim * mlp_ratio),
                        out_features=in_dim,
                        dropout=dropout)
 
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.0))
+        return weight_attr, bias_attr
+
     def forward(self, x):
         x = self.norm1(x)
         x = self.attn(x)
@@ -532,9 +597,24 @@ def __init__(self,
                                  droppath=depth_decay[i])
             layer_list.append(copy.deepcopy(block_layers))
         self.blocks = nn.LayerList(layer_list)
-        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6)
+        w_attr_1, b_attr_1 = self._init_weights_layernorm()
+        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6, weight_attr=w_attr_1, bias_attr=b_attr_1)
         # classifier head
-        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else Identity()
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.head = nn.Linear(embed_dim,
+                              num_classes,
+                              weight_attr=w_attr_2,
+                              bias_attr=b_attr_2) if num_classes > 0 else Identity()
+
+    def _init_weights_layernorm(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(1))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0))
+        return weight_attr, bias_attr
 
     def forward_features(self, x):
         # Patch Embedding
@@ -561,11 +641,17 @@ def forward(self, x):
 def build_t2t_vit(config):
     """build t2t-vit model using config"""
     model = T2TViT(image_size=config.DATA.IMAGE_SIZE,
+                   in_channels=3,
+                   num_classes=config.MODEL.NUM_CLASSES,
                    token_type=config.MODEL.TRANS.TOKEN_TYPE,
                    embed_dim=config.MODEL.TRANS.EMBED_DIM,
                    depth=config.MODEL.TRANS.DEPTH,
                    num_heads=config.MODEL.TRANS.NUM_HEADS,
                    mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
                    qk_scale=config.MODEL.TRANS.QK_SCALE,
-                   qkv_bias=config.MODEL.TRANS.QKV_BIAS)
+                   qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+                   dropout=config.MODEL.DROPOUT,
+                   attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+                   droppath=config.MODEL.DROPPATH,
+                   token_dim=64)
     return model
diff --git a/image_classification/T2T_ViT/transforms.py b/image_classification/T2T_ViT/transforms.py
new file mode 100644
index 00000000..676fe1ff
--- /dev/null
+++ b/image_classification/T2T_ViT/transforms.py
@@ -0,0 +1,13 @@
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/T2T_ViT/utils.py b/image_classification/T2T_ViT/utils.py
index 44800527..24313440 100644
--- a/image_classification/T2T_ViT/utils.py
+++ b/image_classification/T2T_ViT/utils.py
@@ -20,6 +20,8 @@
 """
 
 import math
+import numpy as np
+import paddle
 from paddle.optimizer.lr import LRScheduler
 
 
@@ -118,3 +120,34 @@ def get_lr(self):
         val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
         val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
         return val
+
+
+def orthogonal(t, gain=1.):
+    if t.ndim < 2:
+        raise ValueError("Only tensors with 2 or more dimensions are supported")
+    
+    gain  = paddle.to_tensor(gain)
+    rows = t.shape[0]
+    cols = np.size(t) // rows
+    #cols = paddle.numel(t) // rows
+    flattened = paddle.normal(0, 1, [rows, cols])
+
+    if rows < cols:
+        flattened = flattened.transpose([1, 0])
+    
+    # Compute the qr factorization
+    q, r = np.linalg.qr(flattened.cpu().numpy())
+    q = paddle.to_tensor(q) 
+    r = paddle.to_tensor(r) 
+    d = paddle.diag(r, 0)
+    ph = d.sign()
+    q *= ph
+    
+    if rows < cols:
+        q = q.transpose([1, 0])
+    
+    with paddle.no_grad():
+        t = q
+        #t.view_as(q).copy_(q)
+        t = t.multiply(gain)
+    return t
diff --git a/image_classification/VOLO/README.md b/image_classification/VOLO/README.md
index 21fa2b12..9a07659b 100644
--- a/image_classification/VOLO/README.md
+++ b/image_classification/VOLO/README.md
@@ -1,4 +1,4 @@
-# VOLO: Vision Outlooker for Visual Recognition, [arxiv](https://arxiv.org/abs/2103.17239) 
+# VOLO: Vision Outlooker for Visual Recognition, [arxiv](https://arxiv.org/abs/2106.13112) 
 
 PaddlePaddle training/validation code and pretrained models for **VOLO**.
 
@@ -13,13 +13,23 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 </p>
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): More weights are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| volo_d5_224_86.10              | 86.08 | 97.58 | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1GBOBPCBJYZfWybK-Xp0Otn0N4NXpct0G/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1t9gPLRAOkdXaG55fVADQZg)(td49) |
-| volo_d5_512_87.07              | 87.05 | 97.97 | 512        | 1.15     | bicubic       | [google](https://drive.google.com/file/d/1Phf_wHsjRZ1QrZ8oFrqsYuhDr4TXrVkc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1X-WjpNqvWva2M977jgHosg)(irik) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| volo_d1_224  					| 84.12 | 96.78 | 26.6M   | 6.6G   | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kNNtTh7MUWJpFSDe_7IoYTOpsZk5QSR9/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1EKlKl2oHi_24eaiES67Bgw)(xaim) |
+| volo_d1_384  					| 85.24 | 97.21 | 26.6M   | 19.5G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1fku9-11O_gQI7UpZTjagVeND-pcHbV0C/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1qZWoFA7J89i2aujPItEdDQ)(rr7p) |
+| volo_d2_224  					| 85.11 | 97.19 | 58.6M   | 13.7G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1KjKzGpyPKq6ekmeEwttHlvOnQXqHK1we/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1JCK0iaYtiOZA6kn7e0wzUQ)(d82f) |
+| volo_d2_384  					| 86.04 | 97.57 | 58.6M   | 40.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1uLLbvwNK8N0y6Wrq_Bo8vyBGSVhehVmq/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1e7H5aa6miGpCTCgpK0rm0w)(9cf3) |
+| volo_d3_224  					| 85.41 | 97.26 | 86.2M   | 19.8G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1OtOX7C29fJ20ESKQnYGevp4euxhmXKAT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1vhARtV2wfI6EFf0Ap71xwg)(a5a4) |
+| volo_d3_448  					| 86.50 | 97.71 | 86.2M   | 80.3G  | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lHlYhra1NNp0dp4NWaQ9SMNNmw-AxBNZ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Q6KiQw4Vu1GPm5RF9_eycg)(uudu) |
+| volo_d4_224  					| 85.89 | 97.54 | 192.8M  | 42.9G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/16oXN7xuy-mkpfeD-loIVOK95PfptHhpX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1PE83ZLd5evkKmHJ1V2KDsg)(vcf2) |
+| volo_d4_448  					| 86.70 | 97.85 | 192.8M  | 172.5G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1N9-1OhPewA5TBR9CX5oA10obDS8e4Cfa/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1QoJ2Sqe1SK9hxbmV4uZiyg)(nd4n) |
+| volo_d5_224  					| 86.08 | 97.58 | 295.3M  | 70.6G  | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1fcrvOGbAmKUhqJT-pU3MVJZQJIe4Qina/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1nqDcXMW00v9PKr3RQI-g1w)(ymdg) |
+| volo_d5_448  					| 86.92 | 97.88 | 295.3M  | 283.8G | 448        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1aFXEkpfLhmQlDQHUYCuFL8SobhxUzrZX/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1K4FBv6fnyMGcAXhyyybhgw)(qfcc) |
+| volo_d5_512  					| 87.05 | 97.97 | 295.3M  | 371.3G | 512        | 1.15     | bicubic       | [google](https://drive.google.com/file/d/1CS4-nv2c9FqOjMz7gdW5i9pguI79S6zk/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16Wseyiqvv0MQJV8wwFDfSA)(353h) |
 
 > *The results are evaluated on ImageNet2012 validation set.
 
@@ -64,8 +74,8 @@ from volo import build_volo as build_model
 config = get_config('./configs/volo_d5_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./volo_d5_224')
+# load pretrained weights
+model_state_dict = paddle.load('./volo_d5_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -78,12 +88,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/volo_d5_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/volo_d5_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./volo_d5_224'
+    -pretrained=/path/to/pretrained/model/volo_d5_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -100,12 +110,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/volo_d5_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/volo_d5_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./volo_d5_224'
+    -pretrained=/path/to/pretrained/model/volo_d5_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -119,10 +129,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/volo_d5_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/volo_d5_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 
@@ -140,10 +150,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/volo_d5_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/volo_d5_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/VOLO/__init__.py b/image_classification/VOLO/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/VOLO/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/VOLO/config.py b/image_classification/VOLO/config.py
index b40287e1..10f9ad9f 100644
--- a/image_classification/VOLO/config.py
+++ b/image_classification/VOLO/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -35,6 +35,8 @@
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
 _C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
 _C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -93,6 +95,7 @@
 _C.VALIDATE_FREQ = 100 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
 
@@ -128,6 +131,8 @@ def update_config(config, args):
         config.DATA.IMAGE_SIZE = args.image_size
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -139,6 +144,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/VOLO/configs/volo_d1_224.yaml b/image_classification/VOLO/configs/volo_d1_224.yaml
new file mode 100644
index 00000000..4ea0161b
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d1_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d1_224
+    TRANS:
+        LAYERS: [4, 4, 8, 2]
+        EMBED_DIMS: [192, 384, 384, 384]
+        NUM_HEADS: [6, 12, 12, 12]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d1_384.yaml b/image_classification/VOLO/configs/volo_d1_384.yaml
new file mode 100644
index 00000000..30840e04
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d1_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d1_384
+    TRANS:
+        LAYERS: [4, 4, 8, 2]
+        EMBED_DIMS: [192, 384, 384, 384]
+        NUM_HEADS: [6, 12, 12, 12]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d2_224.yaml b/image_classification/VOLO/configs/volo_d2_224.yaml
new file mode 100644
index 00000000..c800419d
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d2_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d2_224
+    TRANS:
+        LAYERS: [6, 4, 10, 4]
+        EMBED_DIMS: [256, 512, 512, 512]
+        NUM_HEADS: [8, 16, 16, 16]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d2_384.yaml b/image_classification/VOLO/configs/volo_d2_384.yaml
new file mode 100644
index 00000000..ce0d473b
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d2_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d2_384
+    TRANS:
+        LAYERS: [6, 4, 10, 4]
+        EMBED_DIMS: [256, 512, 512, 512]
+        NUM_HEADS: [8, 16, 16, 16]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d3_224.yaml b/image_classification/VOLO/configs/volo_d3_224.yaml
new file mode 100644
index 00000000..36910c65
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d3_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d3_224
+    TRANS:
+        LAYERS: [8, 8, 16, 4]
+        EMBED_DIMS: [256, 512, 512, 512]
+        NUM_HEADS: [8, 16, 16, 16]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d3_448.yaml b/image_classification/VOLO/configs/volo_d3_448.yaml
new file mode 100644
index 00000000..5dd594ae
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d3_448.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 448
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d3_448
+    TRANS:
+        LAYERS: [8, 8, 16, 4]
+        EMBED_DIMS: [256, 512, 512, 512]
+        NUM_HEADS: [8, 16, 16, 16]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d4_224.yaml b/image_classification/VOLO/configs/volo_d4_224.yaml
new file mode 100644
index 00000000..c734b0a0
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d4_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d4_224
+    TRANS:
+        LAYERS: [8, 8, 16, 4]
+        EMBED_DIMS: [384, 768, 768, 768]
+        NUM_HEADS: [12, 16, 16, 16]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d4_448.yaml b/image_classification/VOLO/configs/volo_d4_448.yaml
new file mode 100644
index 00000000..4468f99d
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d4_448.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 448
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d4_448
+    TRANS:
+        LAYERS: [8, 8, 16, 4]
+        EMBED_DIMS: [384, 768, 768, 768]
+        NUM_HEADS: [12, 16, 16, 16]
+        MLP_RATIOS: [3, 3, 3, 3]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 64
diff --git a/image_classification/VOLO/configs/volo_d5_448.yaml b/image_classification/VOLO/configs/volo_d5_448.yaml
new file mode 100644
index 00000000..ba52e6b7
--- /dev/null
+++ b/image_classification/VOLO/configs/volo_d5_448.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 448
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: volo
+    NAME: volo_d5_448
+    TRANS:
+        LAYERS: [12, 12, 20, 4]
+        EMBED_DIMS: [384, 768, 768, 768]
+        NUM_HEADS: [12, 16, 16, 16]
+        MLP_RATIOS: [4, 4, 4, 4]
+        DOWNSAMPLES: [True, False, False, False]
+        OUTLOOK_ATTENTION: [True, False, False, False]
+    STEM_HIDDEN_DIM: 128
diff --git a/image_classification/VOLO/datasets.py b/image_classification/VOLO/datasets.py
index eeb16f89..1e6ea8c9 100644
--- a/image_classification/VOLO/datasets.py
+++ b/image_classification/VOLO/datasets.py
@@ -85,8 +85,7 @@ def get_train_transforms(config):
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
                                      scale=(0.05, 1.0)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_train
 
@@ -111,8 +110,7 @@ def get_val_transforms(config):
         transforms.Resize(scale_size, 'bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        #transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/VOLO/losses.py b/image_classification/VOLO/losses.py
new file mode 100644
index 00000000..41f69614
--- /dev/null
+++ b/image_classification/VOLO/losses.py
@@ -0,0 +1,223 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class TokenLabelGTCrossEntropy(nn.Layer):
+    def __init__(self,
+                 dense_weight=1.0,
+                 cls_weight=1.0,
+                 mixup_activate=True,
+                 smoothing=0.1,
+                 classes=1000):
+        super().__init__()
+        self.CE = SoftTargetCrossEntropy()
+
+        self.dense_weight = dense_weight
+        self.smoothing = smoothing
+        self.mixup_activate = mixup_activate
+        self.classes = classes
+        self.cls_weight = cls_weight
+        assert dense_weight + cls_weight > 0
+
+    def forward(self, x, target):
+        output, aux_output, bb = x
+        bbx1, bby1, bbx2, bby2 = bb
+        B, N, C = aux_output.shape
+        if len(target.shape) == 2:
+            target_cls = target
+#TODO: fix bugs
+            target_aux = target.expand([1, N]).reshape((B*N, C))
+        else:
+            ground_truth = target[:, :, 0]
+            target_cls = target[:, :, 1]
+            ratio = (0.9 - 0.4 * (ground_truth.max(-1)[1] == target_cls.max(-1)[1])).unsqueeze(-1)
+            target_cls = target_cls * ratio + ground_truth * (1 - ratio)
+            target_aux = target[:, :, 2:]
+            target_aux = target_aux.transpose([0, 2, 1]).reshape((-1, C))
+        lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / N)
+        if lam < 1:
+            target_cls = lam * target_cls + (1 - lam) * target_cls.flip(0)
+
+        aux_output = aux_output.reshape((-1, C))
+
+        loss_cls = self.CE(output, target_cls)
+        loss_aux = self.CE(aux_output, target_aux)
+
+        return self.cls_weigth * loss_cls + self.dense_weight * loss_aux
+
+
+
+class TokenLabelCrossEntropy(nn.Layer):
+    def __init__(self,
+                 dense_weight=1.0,
+                 cls_weight=1.0,
+                 mixup_activate=True,
+                 classes=1000):
+        super().__init__()
+        self.CE = SoftTargetCrossEntropy()
+
+        self.dense_weight = dense_weight
+        self.mixup_activate = mixup_activate
+        self.classes = classes
+        self.cls_weight = cls_weight
+        assert dense_weight + cls_weight > 0
+
+    def forward(self, x, target):
+        output, aux_output, bb = x
+        bbx1, bby1, bbx2, bby2 = bb
+        B, N, C = aux_output.shape
+        if len(target.shape) == 2:
+            target_cls = target
+#TODO: fix bugs
+            target_aux = target.expand([1, N]).reshape((B*N, C))
+        else:
+            target_cls = target[:, :, 1]
+            target_aux = target[:, :, 2:]
+            target_aux = target_aux.transpose([0, 2, 1]).reshape((-1, C))
+        lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / N)
+        if lam < 1:
+            target_cls = lam * target_cls + (1 - lam) * target_cls.flip(0)
+
+        aux_output = aux_output.reshape((-1, C))
+
+        loss_cls = self.CE(output, target_cls)
+        loss_aux = self.CE(aux_output, target_aux)
+
+        return self.cls_weigth * loss_cls + self.dense_weight * loss_aux
+
+
+class TokenLabelSoftTargetCrossEntropy(nn.Module):
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        N_rep = x.shape[0]
+        N = target.shape[0]
+        if not N == N_rep:
+# TODO:
+            target = target.repeat(N_rep // N, 1)
+        if len(target.shape) == 3 and target.shape[-1] == 2:
+            target = target[:, :, 1]
+        loss = paddle.sum(-target * F.log_softmax(x, dim=-1), dim=-1)
+        return loss.mean()
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/VOLO/main_multi_gpu.py b/image_classification/VOLO/main_multi_gpu.py
index 2c0bb7c4..d8e3ed13 100644
--- a/image_classification/VOLO/main_multi_gpu.py
+++ b/image_classification/VOLO/main_multi_gpu.py
@@ -25,7 +25,8 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
+from datasets import get_dataloader
+from datasets import get_dataset
 from volo import build_volo as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
@@ -39,11 +40,13 @@
 parser.add_argument('-batch_size', type=int, default=None)
 parser.add_argument('-image_size', type=int, default=None)
 parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-output', type=str, default=None)
 parser.add_argument('-ngpus', type=int, default=None)
 parser.add_argument('-pretrained', type=str, default=None)
 parser.add_argument('-resume', type=str, default=None)
 parser.add_argument('-last_epoch', type=int, default=None)
 parser.add_argument('-eval', action='store_true')
+parser.add_argument('-amp', action='store_true')
 arguments = parser.parse_args()
 
 
@@ -80,7 +83,8 @@ def train(dataloader,
           epoch,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          amp=False):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
@@ -88,8 +92,9 @@ def train(dataloader,
         criterion: nn.criterion
         epoch: int, current epoch
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        amp: bool, if True, use mix precision training, default: False
     Returns:
         train_loss_meter.avg
         train_acc_meter.avg
@@ -98,26 +103,37 @@ def train(dataloader,
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+        if amp is True:
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
 
-        loss.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else:
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
         acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
@@ -209,13 +225,20 @@ def main_worker(*args):
     model = build_model(config)
     model = paddle.DataParallel(model)
     # STEP 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
     logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
     # STEP 3. Define criterion
     criterion = nn.CrossEntropyLoss()
     # STEP 4. Define optimizer and lr_scheduler
@@ -319,7 +342,8 @@ def main_worker(*args):
                                                   epoch=epoch,
                                                   total_batch=total_batch_train,
                                                   debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  amp=config.AMP)
         scheduler.step()
 
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
@@ -352,7 +376,10 @@ def main_worker(*args):
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
     dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
diff --git a/image_classification/VOLO/main_single_gpu.py b/image_classification/VOLO/main_single_gpu.py
index 3bc0bb92..38083a23 100644
--- a/image_classification/VOLO/main_single_gpu.py
+++ b/image_classification/VOLO/main_single_gpu.py
@@ -39,11 +39,13 @@
 parser.add_argument('-batch_size', type=int, default=None)
 parser.add_argument('-image_size', type=int, default=None)
 parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-output', type=str, default=None)
 parser.add_argument('-ngpus', type=int, default=None)
 parser.add_argument('-pretrained', type=str, default=None)
 parser.add_argument('-resume', type=str, default=None)
 parser.add_argument('-last_epoch', type=int, default=None)
 parser.add_argument('-eval', action='store_true')
+parser.add_argument('-amp', action='store_true')
 args = parser.parse_args()
 
 
@@ -82,7 +84,8 @@ def train(dataloader,
           epoch,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          amp=False):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
@@ -92,6 +95,7 @@ def train(dataloader,
         total_epoch: int, total num of epoch, for logging
         debug_steps: int, num of iters to log info
         accum_iter: int, num of iters for accumulating gradients
+        amp: bool, if True, use mix precision training
     Returns:
         train_loss_meter.avg
         train_acc_meter.avg
@@ -100,25 +104,40 @@ def train(dataloader,
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
+
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
 
-        output = model(image)
-        loss = criterion(output, label)
+        if amp is True:
+            with paddle.amp.auto_cast():
+                output = model(image)
+                output = output[0]
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
 
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
 
-        loss.backward()
+        else:
+            output = model(image)
+            output = output[0]
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
         acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
@@ -180,7 +199,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
                     f"Avg Acc@1: {val_acc1_meter.val:.4f} ({val_acc1_meter.avg:.4f}), " +
-                    f"Avg Acc@1: {val_acc5_meter.val:.4f} ({val_acc5_meter.avg:.4f})")
+                    f"Avg Acc@5: {val_acc5_meter.val:.4f} ({val_acc5_meter.avg:.4f})")
 
     val_time = time.time() - time_st
     return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
@@ -199,9 +218,10 @@ def main():
     model = build_model(config)
 
     # STEP 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
 
     # STEP 3. Define criterion
@@ -269,11 +289,11 @@ def main():
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
-        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
@@ -306,6 +326,7 @@ def main():
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  amp=config.AMP,
                                                   )
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
diff --git a/image_classification/VOLO/port_weights/__init__.py b/image_classification/VOLO/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/VOLO/port_weights/load_pytorch_weights.py b/image_classification/VOLO/port_weights/load_pytorch_weights.py
new file mode 100644
index 00000000..5e6ea54a
--- /dev/null
+++ b/image_classification/VOLO/port_weights/load_pytorch_weights.py
@@ -0,0 +1,261 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import *
+from volo import *
+from pytorch.volo.models.volo import volo_d1, volo_d2, volo_d3, volo_d4, volo_d5
+from pytorch.volo.utils import load_pretrained_weights
+
+
+names = [
+    ('volo_d5_448', 'd5_448_87.0', volo_d5),
+    ('volo_d4_448', 'd4_448_86.79', volo_d4),
+    ('volo_d4_224', 'd4_224_85.7', volo_d4),
+    ('volo_d3_448', 'd3_448_86.3', volo_d3),
+    ('volo_d3_224', 'd3_224_85.4', volo_d3),
+    ('volo_d2_384', 'd2_384_86.0', volo_d2),
+    ('volo_d2_224', 'd2_224_85.2', volo_d2),
+    ('volo_d1_384', 'd1_384_85.2', volo_d1),
+    ('volo_d1_224', 'd1_224_84.2', volo_d1),
+    ]
+idx = 8
+gmodel_name = names[idx][0]
+gmodel_path = names[idx][1]
+sz = int(gmodel_name[-3::])
+model_type=names[idx][2]
+
+
+config = get_config()
+parser = argparse.ArgumentParser('')
+parser.add_argument('-cfg', type=str, default=f'./configs/{gmodel_name}.yaml')
+#parser.add_argument('-cfg', type=str, default='./configs/volo_d5_224.yaml')
+parser.add_argument('-dataset', type=str, default="imagenet2012")
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-image_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-eval', action="store_true")
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+args = parser.parse_args()
+
+config = get_config()
+config = update_config(config, args)
+print(config)
+
+
+def print_model_named_params(model):
+    print('----------------------------------')
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+    print('----------------------------------')
+
+def print_model_named_buffers(model):
+    print('----------------------------------')
+    for name, param in model.named_buffers():
+        print(name, param.shape)
+    print('----------------------------------')
+
+def torch_to_paddle_mapping():
+    mapping = [
+        ('cls_token', 'cls_token'),
+        ('pos_embed', 'pos_embed'),
+        ('patch_embed.proj', 'patch_embed.proj'),
+    ]
+    
+    # patch embedding:
+    th_prefix = 'patch_embed.conv'
+    pp_prefix = 'patch_embed.stem'
+    layer_mapping = [
+        (f'{th_prefix}.0.weight', f'{pp_prefix}.0.weight'),#conv
+        (f'{th_prefix}.1.weight', f'{pp_prefix}.1.weight'),#bn
+        (f'{th_prefix}.1.bias', f'{pp_prefix}.1.bias'),#bn
+        (f'{th_prefix}.1.running_mean', f'{pp_prefix}.1._mean'),#bn
+        (f'{th_prefix}.1.running_var', f'{pp_prefix}.1._variance'),#bn
+        (f'{th_prefix}.3.weight', f'{pp_prefix}.3.weight'),#conv
+        (f'{th_prefix}.4.weight', f'{pp_prefix}.4.weight'),#bn
+        (f'{th_prefix}.4.bias', f'{pp_prefix}.4.bias'),#bn
+        (f'{th_prefix}.4.running_mean', f'{pp_prefix}.4._mean'),#bn
+        (f'{th_prefix}.4.running_var', f'{pp_prefix}.4._variance'),#bn
+        (f'{th_prefix}.6.weight', f'{pp_prefix}.6.weight'),#conv
+        (f'{th_prefix}.7.weight', f'{pp_prefix}.7.weight'),#bn
+        (f'{th_prefix}.7.bias', f'{pp_prefix}.7.bias'),#bn
+        (f'{th_prefix}.7.running_mean', f'{pp_prefix}.7._mean'),#bn
+        (f'{th_prefix}.7.running_var', f'{pp_prefix}.7._variance'),#bn
+    ]
+    mapping.extend(layer_mapping)
+
+    # models
+    for idx, stage_idx in enumerate([0, 2, 3, 4]):
+        for layer_idx in range(config.MODEL.TRANS.LAYERS[idx]):
+            pp_prefix = f'model.{stage_idx}.{layer_idx}'
+            th_prefix = f'network.{stage_idx}.{layer_idx}'
+
+            if config.MODEL.TRANS.OUTLOOK_ATTENTION[idx]:
+                layer_mapping = [
+                    (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                    (f'{th_prefix}.attn.v.weight', f'{pp_prefix}.attn.v.weight'),
+                    (f'{th_prefix}.attn.attn', f'{pp_prefix}.attn.attn'),
+                    (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                    (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                    (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                    (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                ]
+            else:
+                layer_mapping = [
+                    (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                    (f'{th_prefix}.attn.qkv.weight', f'{pp_prefix}.attn.qkv.weight'),
+                    (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                    (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                    (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                    (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                ]
+            mapping.extend(layer_mapping)
+
+    layer_mapping = [
+        ('network.1.proj', 'model.1.proj'),
+    ]
+    mapping.extend(layer_mapping)
+    # Post layers
+    pp_prefix = f'post_model'
+    th_prefix = f'post_network'
+    for idx in range(2):
+        layer_mapping = [
+            (f'{th_prefix}.{idx}.norm1', f'{pp_prefix}.{idx}.norm1'),
+            (f'{th_prefix}.{idx}.attn.kv.weight', f'{pp_prefix}.{idx}.attn.kv.weight'),
+            (f'{th_prefix}.{idx}.attn.q.weight', f'{pp_prefix}.{idx}.attn.q.weight'),
+            (f'{th_prefix}.{idx}.attn.proj', f'{pp_prefix}.{idx}.attn.proj'),
+            (f'{th_prefix}.{idx}.norm2', f'{pp_prefix}.{idx}.norm2'),
+            (f'{th_prefix}.{idx}.mlp.fc1', f'{pp_prefix}.{idx}.mlp.fc1'),
+            (f'{th_prefix}.{idx}.mlp.fc2', f'{pp_prefix}.{idx}.mlp.fc2'),
+        ]
+        mapping.extend(layer_mapping)
+    # Head layers
+    head_mapping = [
+        ('aux_head', 'aux_head'),
+        ('norm', 'norm'),
+        ('head', 'head')
+    ]
+    mapping.extend(head_mapping)
+
+    return mapping
+
+
+def convert(torch_model, paddle_model):
+    def _set_value(th_name, pd_name):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'set {th_name} {th_shape} to {pd_name} {pd_shape}')
+        value = th_params[th_name].cpu().data.numpy()
+        if len(value.shape) == 2:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    th_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, param in paddle_model.named_buffers():
+        pd_params[name] = param
+
+    for name, param in torch_model.named_parameters():
+        th_params[name] = param
+    for name, param in torch_model.named_buffers():
+        th_params[name] = param
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            th_name_w = f'{th_name}.weight'
+            pd_name_w = f'{pd_name}.weight'
+            _set_value(th_name_w, pd_name_w)
+
+            th_name_b = f'{th_name}.bias'
+            pd_name_b = f'{pd_name}.bias'
+            _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def main():
+
+    paddle.set_device('cpu')
+    paddle_model = build_volo(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+
+    device = torch.device('cpu')
+    torch_model = model_type(img_size=config.DATA.IMAGE_SIZE)
+
+
+    #torch_model = volo_d5(img_size=config.DATA.IMAGE_SIZE) 
+    load_pretrained_weights(torch_model, f'./pytorch/volo/{gmodel_path}.pth.tar',
+    #load_pretrained_weights(torch_model, './pytorch/volo/d5_224_86.10.pth.tar',
+        use_ema=False, strict=False, num_classes=1000)
+    torch_model = torch_model.to(device)
+    torch_model.eval()
+
+    print_model_named_params(torch_model)
+    print_model_named_buffers(torch_model)
+
+    # convert weights
+    paddle_model = convert(torch_model, paddle_model)
+
+    # check correctness
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    #x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x_paddle = paddle.to_tensor(x)
+    x_torch = torch.Tensor(x).to(device)
+
+    out_torch = torch_model(x_torch)
+    print('========================================================')
+    print('========================================================')
+    print('========================================================')
+    print('========================================================')
+    out_paddle = paddle_model(x_paddle)
+
+    out_torch = out_torch.data.cpu().numpy()
+    out_paddle = out_paddle.cpu().numpy()
+
+    print(out_torch.shape, out_paddle.shape)
+    print(out_torch[1, 0:100])
+    print(out_paddle[1, 0:100])
+    assert np.allclose(out_torch[0], out_paddle[0], atol = 1e-3)
+    print('===== out 0 equal OK')
+    assert np.allclose(out_torch[1], out_paddle[1], atol = 1e-3)
+    print('===== out 1 equal OK')
+    
+    # save weights for paddle model
+    print('===== saving .pdparams')
+    model_path = os.path.join(f'./{gmodel_path}.pdparams')
+    #model_path = os.path.join('./d5_512_87.07.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+    print('all done')
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/VOLO/run_eval_multi_tmp.sh b/image_classification/VOLO/run_eval_multi_tmp.sh
new file mode 100644
index 00000000..461a5868
--- /dev/null
+++ b/image_classification/VOLO/run_eval_multi_tmp.sh
@@ -0,0 +1,9 @@
+#CUDA_VISIBLE_DEVICES=4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/volo_d1_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./d1_224_84.2' \
diff --git a/image_classification/VOLO/run_train.sh b/image_classification/VOLO/run_train.sh
index f907e189..94cd7cd4 100644
--- a/image_classification/VOLO/run_train.sh
+++ b/image_classification/VOLO/run_train.sh
@@ -1,6 +1,7 @@
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
--cfg='./configs/volo_d5_224.yaml' \
+-cfg='./configs/volo_d1_224.yaml' \
 -dataset='imagenet2012' \
 -batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/VOLO/stat.py b/image_classification/VOLO/stat.py
new file mode 100644
index 00000000..d5033b6a
--- /dev/null
+++ b/image_classification/VOLO/stat.py
@@ -0,0 +1,64 @@
+import os
+import glob
+import paddle
+from config import get_config
+from volo import build_volo as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+for cfg in glob.glob('./configs/*_448.yaml'):
+    #cfg = './configs/pvtv2_b0.yaml'
+    #input_size = (1, 3, 512, 512)
+    #input_size = (1, 3, 224, 224)
+    #input_size = (1, 3, 384, 384)
+    input_size = (1, 3, 448, 448)
+    config = get_config(cfg)
+    model = build_model(config)
+    
+    custom_ops = {paddle.nn.GELU: count_gelu,
+                  paddle.nn.LayerNorm: count_layernorm,
+                  paddle.nn.Softmax: count_softmax,
+                }
+    print(os.path.basename(cfg))
+    paddle.flops(model,
+                 input_size=input_size,
+                 custom_ops=custom_ops,
+                 print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/VOLO/volo.py b/image_classification/VOLO/volo.py
index 371ccf18..44bf5c6e 100644
--- a/image_classification/VOLO/volo.py
+++ b/image_classification/VOLO/volo.py
@@ -733,7 +733,8 @@ def forward(self, x):
             sbby1 = self.pooling_scale * bby1
             sbbx2 = self.pooling_scale * bbx2
             sbby2 = self.pooling_scale * bby2
-            temp_x[:, sbbx1: sbbx2, sbby1: sbby2, :] = x.flip(axis=[0])[:, sbbx1: sbbx2, sbby1: sbby2, :]
+            if sbbx2 > sbbx1 and sbby2 > sbby1:
+                temp_x[:, sbbx1: sbbx2, sbby1: sbby2, :] = x.flip(axis=[0])[:, sbbx1: sbbx2, sbby1: sbby2, :]
             x = temp_x
         else:
             bbx1, bby1, bbx2, bby2 = 0, 0, 0, 0
@@ -770,7 +771,8 @@ def forward(self, x):
         if self.mix_token and self.training:
             x_aux = x_aux.reshape([x_aux.shape[0], patch_h, patch_w, x_aux.shape[-1]])
             temp_x = x_aux.clone()
-            temp_x[:, bbx1:bbx2, bby1:bby2, :] = x_aux.flip(axis=[0])[:, bbx1:bbx2, bby1:bby2, :]
+            if bbx2 > bbx1 and bby2 > bby1:
+                temp_x[:, bbx1:bbx2, bby1:bby2, :] = x_aux.flip(axis=[0])[:, bbx1:bbx2, bby1:bby2, :]
             x_aux = temp_x
             x_aux = x_aux.reshape([x_aux.shape[0], patch_h*patch_w, x_aux.shape[-1]])
 
diff --git a/image_classification/ViP/README.md b/image_classification/ViP/README.md
new file mode 100644
index 00000000..8680c635
--- /dev/null
+++ b/image_classification/ViP/README.md
@@ -0,0 +1,171 @@
+# Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition, [arxiv](https://arxiv.org/abs/2106.12368) 
+
+PaddlePaddle training/validation code and pretrained models for **ViP**.
+
+The official and 3rd party pytorch implementation are [here](https://github.com/Andrew-Qibin/VisionPermutator).
+
+
+This implementation is developed by [PPViT](https://github.com/BR-IDL/PaddleViT/).
+
+<p align="center">
+<img src="./vip_1.png" alt="drawing" width="90%" height="90%"/>
+<img src="./vip_2.png" alt="drawing" width="90%" height="90%"/>
+<h4 align="center">ViP Model Overview</h4>
+</p>
+
+
+### Update 
+- Update (2021-11-03): Code and weights are updated.
+- Update (2021-09-23): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| vip_s7  						| 81.50 | 95.76 | 25.1M   | 7.0G   |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/16bZkqzbnN08_o15k3MzbegK8SBwfQAHF/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1uY0FsNPYaM8cr3ZCdAoVkQ)(mh9b) |
+| vip_m7  						| 82.75 | 96.05 | 55.3M   | 16.4G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/11lvT2OXW0CVGPZdF9dNjY_uaEIMYrmNu/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1j3V0Q40iSqOY15bTKlFFRw)(hvm8) |
+| vip_l7  						| 83.18 | 96.37 | 87.8M   | 24.5G  |    224     | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1bK08JorLPMjYUep_TnFPKGs0e1j0UBKJ/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1I5hnv3wHWEaG3vpDqaNL-w)(tjvh) |
+> *The results are evaluated on ImageNet2012 validation set.
+> 
+> Note: ViP weights are ported from [here](https://github.com/Andrew-Qibin/VisionPermutator)
+
+
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+ImageNet2012 dataset is used in the following folder structure:
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./vip_s7.pdparams`, to use the `vip_s7` model in python:
+```python
+from config import get_config
+from vip import build_vip as build_model
+# config files in ./configs/
+config = get_config('./configs/vip_s7.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights
+model_state_dict = paddle.load('./vip_s7.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate ViP model performance on ImageNet2012 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/vip_s7.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/vip_s7  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/vip_s7.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/vip_s7  # .pdparams is NOT needed
+```
+
+</details>
+
+## Training
+To train the ViP Transformer model on ImageNet2012 with single GPUs, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+  -cfg=./configs/vip_s7.yaml \
+  -dataset=imagenet2012 \
+  -batch_size=32 \
+  -data_path=/path/to/dataset/imagenet/train
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/vip_s7.yaml \
+    -dataset=imagenet2012 \
+    -batch_size=16 \
+    -data_path=/path/to/dataset/imagenet/train
+```
+
+</details>
+
+
+## Visualization Attention Map
+**(coming soon)**
+
+## Reference
+```
+@misc{hou2021vision,
+    title={Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition},
+    author={Qibin Hou and Zihang Jiang and Li Yuan and Ming-Ming Cheng and Shuicheng Yan and Jiashi Feng},
+    year={2021},
+    eprint={2106.12368},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/image_classification/ViP/__init__.py b/image_classification/ViP/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/ViP/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/ViP/augment.py b/image_classification/ViP/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/ViP/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/ViP/config.py b/image_classification/ViP/config.py
new file mode 100644
index 00000000..fd6f36a7
--- /dev/null
+++ b/image_classification/ViP/config.py
@@ -0,0 +1,178 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 4 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'ViP'
+_C.MODEL.NAME = 'ViP'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPPATH = 0.1
+_C.MODEL.DROPOUT = 0.
+_C.MODEL.ATTENTION_DROPOUT = 0.
+
+# transformer settings
+_C.MODEL.MIXER = CN()
+_C.MODEL.MIXER.LAYER = [4, 3, 8, 3]
+_C.MODEL.MIXER.EMBED_DIMS = [192, 384, 384, 384]
+_C.MODEL.MIXER.TRANSITIONS = [True, False, False, False]
+_C.MODEL.MIXER.SEGMENT_DIM = [32, 16, 16, 16]
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 1e-3
+_C.TRAIN.WARMUP_START_LR = 1e-6
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = True #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/ViP/configs/vip_l7.yaml b/image_classification/ViP/configs/vip_l7.yaml
new file mode 100644
index 00000000..8be0c505
--- /dev/null
+++ b/image_classification/ViP/configs/vip_l7.yaml
@@ -0,0 +1,18 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViP
+    NAME: vip_l7
+    MIXER:
+        LAYER: [8, 8, 16, 4]
+        TRANSITIONS: [True, False, False, False]
+        SEGMENT_DIM: [32, 16, 16, 16]
+        EMBED_DIMS: [256, 512, 512, 512]
+    DROPPATH: 0.3
+    DROPOUT: 0.0
+    ATTENTION_DROPOUT: 0.0
+TRAIN:
+    BASE_LR: 1e-3
+    LINEAR_SCALED_LR: 1024
+    WEIGHT_DECAY: 5e-2
diff --git a/image_classification/ViP/configs/vip_m7.yaml b/image_classification/ViP/configs/vip_m7.yaml
new file mode 100644
index 00000000..3eb9b05f
--- /dev/null
+++ b/image_classification/ViP/configs/vip_m7.yaml
@@ -0,0 +1,18 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViP
+    NAME: vip_m7
+    MIXER:
+        LAYER: [4, 3, 14, 3]
+        TRANSITIONS: [False, True, False, False]
+        SEGMENT_DIM: [32, 32, 16, 16]
+        EMBED_DIMS: [256, 256, 512, 512]
+    DROPPATH: 0.2
+    DROPOUT: 0.0
+    ATTENTION_DROPOUT: 0.0
+TRAIN:
+    BASE_LR: 1e-3
+    LINEAR_SCALED_LR: 1024
+    WEIGHT_DECAY: 5e-2
diff --git a/image_classification/ViP/configs/vip_s7.yaml b/image_classification/ViP/configs/vip_s7.yaml
new file mode 100644
index 00000000..14b97259
--- /dev/null
+++ b/image_classification/ViP/configs/vip_s7.yaml
@@ -0,0 +1,19 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViP
+    NAME: vip_s7
+    MIXER:
+        LAYER: [4, 3, 8, 3]
+        TRANSITIONS: [True, False, False, False]
+        SEGMENT_DIM: [32, 16, 16, 16]
+        EMBED_DIMS: [192, 384, 384, 384]
+    DROPPATH: 0.1
+    DROPOUT: 0.0
+    ATTENTION_DROPOUT: 0.0
+TRAIN:
+    BASE_LR: 1e-3
+    LINEAR_SCALED_LR: 1024
+    WEIGHT_DECAY: 5e-2
+
diff --git a/image_classification/ViP/datasets.py b/image_classification/ViP/datasets.py
new file mode 100644
index 00000000..9b8cbd2d
--- /dev/null
+++ b/image_classification/ViP/datasets.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/ViP/droppath.py b/image_classification/ViP/droppath.py
new file mode 100644
index 00000000..65e0a782
--- /dev/null
+++ b/image_classification/ViP/droppath.py
@@ -0,0 +1,49 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+
+import paddle
+import paddle.nn as nn
+
+def drop_path(inputs, drop_prob=0., training=False):
+    """drop path op
+    Args:
+        input: tensor with arbitrary shape
+        drop_prob: float number of drop path probability, default: 0.0
+        training: bool, if current mode is training, default: False
+    Returns:
+        output: output tensor after drop path
+    """
+    # if prob is 0 or eval mode, return original input
+    if drop_prob == 0. or not training:
+        return inputs
+    keep_prob = 1 - drop_prob
+    shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+    random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+    random_tensor = random_tensor.floor() # mask
+    output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+    return output
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, inputs):
+        return drop_path(inputs, self.drop_prob, self.training)
diff --git a/image_classification/ViP/losses.py b/image_classification/ViP/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/ViP/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/ViP/main_multi_gpu.py b/image_classification/ViP/main_multi_gpu.py
new file mode 100644
index 00000000..6bd4fdf3
--- /dev/null
+++ b/image_classification/ViP/main_multi_gpu.py
@@ -0,0 +1,585 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ViP training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from vip import build_vip as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Swin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/ViP/main_single_gpu.py b/image_classification/ViP/main_single_gpu.py
new file mode 100644
index 00000000..3c0e7de0
--- /dev/null
+++ b/image_classification/ViP/main_single_gpu.py
@@ -0,0 +1,428 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""ViP training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import copy
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from losses import DistillationLoss
+from vip import build_vip as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('Swin')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(image, output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/ViP/mixup.py b/image_classification/ViP/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/ViP/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/ViP/random_erasing.py b/image_classification/ViP/random_erasing.py
new file mode 100644
index 00000000..80d31dd8
--- /dev/null
+++ b/image_classification/ViP/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, inputs):
+        if len(inputs.shape) == 3:
+            self._erase(inputs, *inputs.shape, inputs.dtype)
+        else:
+            batch_size, chan, img_h, img_w = inputs.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(inputs[i], chan, img_h, img_w, inputs.dtype)
+        return inputs
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/ViP/run_eval.sh b/image_classification/ViP/run_eval.sh
new file mode 100644
index 00000000..1b8826f7
--- /dev/null
+++ b/image_classification/ViP/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/vip_m7.yaml' \
+-dataset='imagenet2012' \
+-batch_size=32 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./vip_m7'
diff --git a/image_classification/ViP/run_eval_multi.sh b/image_classification/ViP/run_eval_multi.sh
new file mode 100644
index 00000000..6afac9f0
--- /dev/null
+++ b/image_classification/ViP/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/vip_s7.yaml' \
+-dataset='imagenet2012' \
+-batch_size=128 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./vip_s7'
diff --git a/image_classification/ViP/run_train.sh b/image_classification/ViP/run_train.sh
new file mode 100644
index 00000000..a80e71b8
--- /dev/null
+++ b/image_classification/ViP/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/vip_s7.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/ViP/run_train_multi.sh b/image_classification/ViP/run_train_multi.sh
new file mode 100644
index 00000000..c1087bfc
--- /dev/null
+++ b/image_classification/ViP/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/vip_s7.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/ViP/transforms.py b/image_classification/ViP/transforms.py
new file mode 100644
index 00000000..676fe1ff
--- /dev/null
+++ b/image_classification/ViP/transforms.py
@@ -0,0 +1,13 @@
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/image_classification/ViP/utils.py b/image_classification/ViP/utils.py
new file mode 100644
index 00000000..f5bdb636
--- /dev/null
+++ b/image_classification/ViP/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/ViP/vip.png b/image_classification/ViP/vip.png
new file mode 100644
index 00000000..f1b1409e
Binary files /dev/null and b/image_classification/ViP/vip.png differ
diff --git a/image_classification/ViP/vip.py b/image_classification/ViP/vip.py
new file mode 100644
index 00000000..88fe1ac7
--- /dev/null
+++ b/image_classification/ViP/vip.py
@@ -0,0 +1,327 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement MLP Class for ViP
+"""
+
+import paddle.nn as nn
+import paddle.nn.functional as F
+from droppath import DropPath
+
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+
+
+class Identity(nn.Layer):
+    def __init__(self, *args, **kwargs):
+        super(Identity, self).__init__()
+
+    def forward(self, inputs):
+        return inputs
+
+
+class Mlp(nn.Layer):
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.0):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class WeightedPermuteMLP(nn.Layer):
+    def __init__(self,
+                 dim,
+                 segment_dim=8,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0.0,
+                 proj_drop=0.0):
+        super().__init__()
+        self.segment_dim = segment_dim
+
+        self.mlp_c = nn.Linear(dim, dim, bias_attr=qkv_bias)
+        self.mlp_h = nn.Linear(dim, dim, bias_attr=qkv_bias)
+        self.mlp_w = nn.Linear(dim, dim, bias_attr=qkv_bias)
+
+        self.reweight = Mlp(dim, dim // 4, dim * 3)
+
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, H, W, C = x.shape
+
+        S = C // self.segment_dim
+        h = (
+            x.reshape([B, H, W, self.segment_dim, S])
+            .transpose([0, 3, 2, 1, 4])
+            .reshape([B, self.segment_dim, W, H * S])
+        )
+        h = (
+            self.mlp_h(h)
+            .reshape([B, self.segment_dim, W, H, S])
+            .transpose([0, 3, 2, 1, 4])
+            .reshape([B, H, W, C])
+        )
+
+        w = (
+            x.reshape([B, H, W, self.segment_dim, S])
+            .transpose([0, 1, 3, 2, 4])
+            .reshape([B, H, self.segment_dim, W * S])
+        )
+        w = (
+            self.mlp_w(w)
+            .reshape([B, H, self.segment_dim, W, S])
+            .transpose([0, 1, 3, 2, 4])
+            .reshape([B, H, W, C])
+        )
+
+        c = self.mlp_c(x)
+
+        a = (h + w + c).transpose([0, 3, 1, 2]).flatten(2).mean(2)
+        a = self.reweight(a).reshape([B, C, 3]).transpose([2, 0, 1])
+        a = F.softmax(a, axis=0).unsqueeze(2).unsqueeze(2)
+
+        x = h * a[0] + w * a[1] + c * a[2]
+
+        x = self.proj(x)
+        x = self.proj_drop(x)
+
+        return x
+
+
+class PermutatorBlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 segment_dim,
+                 mlp_ratio=4.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.0,
+                 attn_drop=0.0,
+                 drop_path=0.0,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 skip_lam=1.0,
+                 mlp_fn=WeightedPermuteMLP):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = mlp_fn(
+            dim,
+            segment_dim=segment_dim,
+            qkv_bias=qkv_bias,
+            qk_scale=None,
+            attn_drop=attn_drop,
+        )
+
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer
+        )
+        self.skip_lam = skip_lam
+
+    def forward(self, x):
+        x = x + self.drop_path(self.attn(self.norm1(x))) / self.skip_lam
+        x = x + self.drop_path(self.mlp(self.norm2(x))) / self.skip_lam
+        return x
+
+
+class PatchEmbed(nn.Layer):
+    """Image to Patch Embedding"""
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
+        super().__init__()
+        self.proj = nn.Conv2D(in_chans,
+                              embed_dim,
+                              kernel_size=patch_size,
+                              stride=patch_size)
+
+    def forward(self, x):
+        x = self.proj(x)  # B, C, H, W
+        return x
+
+
+class Downsample(nn.Layer):
+    """Image to Patch Embedding"""
+    def __init__(self, in_embed_dim, out_embed_dim, patch_size):
+        super().__init__()
+        self.proj = nn.Conv2D(in_embed_dim,
+                              out_embed_dim,
+                              kernel_size=patch_size,
+                              stride=patch_size)
+
+    def forward(self, x):
+        x = x.transpose([0, 3, 1, 2])
+        x = self.proj(x)  # B, C, H, W
+        x = x.transpose([0, 2, 3, 1])
+        return x
+
+
+def basic_blocks(dim,
+                 index,
+                 layers,
+                 segment_dim,
+                 mlp_ratio=3.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0,
+                 drop_path_rate=0.0,
+                 skip_lam=1.0,
+                 mlp_fn=WeightedPermuteMLP,
+                 **kwargs):
+    blocks = []
+    for block_idx in range(layers[index]):
+        block_dpr = (
+            drop_path_rate * (block_idx + sum(layers[:index])) / (sum(layers) - 1)
+        )
+        blocks.append(
+            PermutatorBlock(
+                dim,
+                segment_dim,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                attn_drop=attn_drop,
+                drop_path=block_dpr,
+                skip_lam=skip_lam,
+                mlp_fn=mlp_fn,
+            )
+        )
+    blocks = nn.Sequential(*blocks)
+    return blocks
+
+
+class VisionPermutator(nn.Layer):
+    """Vision Permutator"""
+    def __init__(self,
+                 layers,
+                 img_size=224,
+                 patch_size=4,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dims=None,
+                 transitions=None,
+                 segment_dim=None,
+                 mlp_ratios=None,
+                 skip_lam=1.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop_rate=0.0,
+                 attn_drop_rate=0.0,
+                 drop_path_rate=0.0,
+                 norm_layer=nn.LayerNorm,
+                 mlp_fn=WeightedPermuteMLP):
+        super().__init__()
+        self.num_classes = num_classes
+
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dims[0],
+        )
+
+        network = []
+        for i in range(len(layers)):
+            stage = basic_blocks(embed_dims[i],
+                                 i,
+                                 layers,
+                                 segment_dim[i],
+                                 mlp_ratio=mlp_ratios[i],
+                                 qkv_bias=qkv_bias,
+                                 qk_scale=qk_scale,
+                                 attn_drop=attn_drop_rate,
+                                 drop_path_rate=drop_path_rate,
+                                 norm_layer=norm_layer,
+                                 skip_lam=skip_lam,
+                                 mlp_fn=mlp_fn)
+            network.append(stage)
+            if i >= len(layers) - 1:
+                break
+            if transitions[i] or embed_dims[i] != embed_dims[i + 1]:
+                patch_size = 2 if transitions[i] else 1
+                network.append(Downsample(embed_dims[i], embed_dims[i + 1], patch_size))
+
+        self.network = nn.LayerList(network)
+        self.norm = norm_layer(embed_dims[-1])
+
+        # Classifier head
+        self.head = (
+            nn.Linear(embed_dims[-1], num_classes) if num_classes > 0 else Identity()
+        )
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                zeros_(m.bias)
+        elif isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+
+    def forward_embeddings(self, x):
+        x = self.patch_embed(x)
+        # B,C,H,W-> B,H,W,C
+        x = x.transpose([0, 2, 3, 1])
+        return x
+
+    def forward_tokens(self, x):
+        for _, block in enumerate(self.network):
+            x = block(x)
+        B, H, W, C = x.shape
+        x = x.reshape([B, -1, C])
+        return x
+
+    def forward(self, x):
+        x = self.forward_embeddings(x)
+        # B, H, W, C -> B, N, C
+        x = self.forward_tokens(x)
+        x = self.norm(x)
+        return self.head(x.mean(1))
+
+
+def build_vip(config):
+    """build vip model using config"""
+    model = VisionPermutator(num_classes=config.MODEL.NUM_CLASSES,
+                             layers=config.MODEL.MIXER.LAYER,
+                             embed_dims=config.MODEL.MIXER.EMBED_DIMS,
+                             patch_size=7,
+                             transitions=config.MODEL.MIXER.TRANSITIONS,
+                             segment_dim=config.MODEL.MIXER.SEGMENT_DIM,
+                             mlp_ratios=[3, 3, 3, 3],
+                             mlp_fn=WeightedPermuteMLP)
+    return model
diff --git a/image_classification/ViP/vip_1.png b/image_classification/ViP/vip_1.png
new file mode 100644
index 00000000..f1b1409e
Binary files /dev/null and b/image_classification/ViP/vip_1.png differ
diff --git a/image_classification/ViP/vip_2.png b/image_classification/ViP/vip_2.png
new file mode 100644
index 00000000..a82fd3fb
Binary files /dev/null and b/image_classification/ViP/vip_2.png differ
diff --git a/image_classification/ViT/README.md b/image_classification/ViT/README.md
index 76dab359..f5ba0861 100644
--- a/image_classification/ViT/README.md
+++ b/image_classification/ViT/README.md
@@ -14,14 +14,20 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): More weights are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| vit_base_patch16_224           | 84.58 | 97.30 | 224        | 0.875    | bicubic      | [google](https://drive.google.com/file/d/13D9FqU4ISsGxWXURgKW9eLOBV-pYPr-L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ms3o2fHMQpIoVqnEHitRtA)(qv4n) |
-| vit_base_patch16_384           | 85.99 | 98.00 | 384        | 1.0      | bicubic      | [google](https://drive.google.com/file/d/1kWKaAgneDx0QsECxtf7EnUdUZej6vSFT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15ggLdiL98RPcz__SXorrXA)(wsum) |
-| vit_large_patch16_224          | 85.81 | 97.82 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1jgwtmtp_cDWEhZE-FuWhs7lCdpqhAMft/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1HRxUJAwEiKgrWnJSjHyU0A)(1bgk) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| vit_base_patch32_224          | 80.68 | 95.61 | 88.2M   | 4.4G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1DPEhEuu9sDdcmOPukQbR7ZcHq2bxx9cr/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ppOLj5SWlJmA-NjoLCoYIw)(ubyr) |
+| vit_base_patch32_384          | 83.35 | 96.84 | 88.2M   | 12.7G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1nCOSwrDiFBFmTkLEThYwjL9SfyzkKoaf/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1jxnL00ocpmdiPM4fOu4lpg)(3c2f) |
+| vit_base_patch16_224          | 84.58 | 97.30 | 86.4M   | 17.0G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/13D9FqU4ISsGxWXURgKW9eLOBV-pYPr-L/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ms3o2fHMQpIoVqnEHitRtA)(qv4n) |
+| vit_base_patch16_384          | 85.99 | 98.00 | 86.4M   | 49.8G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1kWKaAgneDx0QsECxtf7EnUdUZej6vSFT/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15ggLdiL98RPcz__SXorrXA)(wsum) |
+| vit_large_patch16_224         | 85.81 | 97.82 | 304.1M  | 59.9G  | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1jgwtmtp_cDWEhZE-FuWhs7lCdpqhAMft/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1HRxUJAwEiKgrWnJSjHyU0A)(1bgk) |
+| vit_large_patch16_384         | 87.08 | 98.30 | 304.1M  | 175.9G | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zfw5mdiIm-mPxxQddBFxt0xX-IR-PF2U/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1KvxfIpMeitgXAUZGr5HV8A)(5t91) |
+| vit_large_patch32_384         | 81.51 | 96.09 | 306.5M  | 44.4G  | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1Py1EX3E35jL7DComW-29Usg9788BB26j/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1W8sUs0pObOGpohP4vsT05w)(ieg3) |
+| | | | | | | | | |
 
 > *The results are evaluated on ImageNet2012 validation set.
 
@@ -66,8 +72,8 @@ from visual_transformer import build_vit as build_model
 config = get_config('./configs/vit_base_patch16_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./vit_base_patch16_224')
+# load pretrained weights
+model_state_dict = paddle.load('./vit_base_patch16_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -83,9 +89,9 @@ python main_single_gpu.py \
     -cfg='./configs/vit_base_patch16_224.yaml' \
     -dataset='imagenet2012' \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./vit_base_patch16_224'
+    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -105,9 +111,9 @@ python main_multi_gpu.py \
     -cfg='./configs/vit_base_patch16_224.yaml' \
     -dataset='imagenet2012' \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./vit_base_patch16_224'
+    -pretrained=/path/to/pretrained/model/vit_base_patch16_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -125,7 +131,7 @@ python main_single_gpu.py \
   -cfg='./configs/vit_base_patch16_224.yaml' \
   -dataset='imagenet2012' \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 
@@ -146,7 +152,7 @@ python main_multi_gpu.py \
     -cfg='./configs/vit_base_patch16_224.yaml' \
     -dataset='imagenet2012' \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/ViT/__init__.py b/image_classification/ViT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/ViT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/ViT/config.py b/image_classification/ViT/config.py
index aed498b0..8ab05bce 100644
--- a/image_classification/ViT/config.py
+++ b/image_classification/ViT/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,12 +13,10 @@
 # limitations under the License.
 
 """Configuration
-
 Configuration for data, model archtecture, and training, etc.
 Config can be set by .yaml file or by argparser(limited usage)
-
-
 """
+
 import os
 from yacs.config import CfgNode as CN
 import yaml
@@ -33,19 +31,22 @@
 _C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
+_C.DATA.IMAGE_CHANNELS = 3 # input image channels
 _C.DATA.CROP_PCT = 0.875 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 2 # number of data loading threads 
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.5, 0.5, 0.5] # [0.485, 0.456, 0.406]
+_C.DATA.IMAGENET_STD = [0.5, 0.5, 0.5] # [0.229, 0.224, 0.225]
 
 # model settings
 _C.MODEL = CN()
 _C.MODEL.TYPE = 'ViT'
 _C.MODEL.NAME = 'ViT'
-_C.MODEL.RESUME = None
-_C.MODEL.PRETRAINED = None
-_C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
-_C.MODEL.DROPPATH = 0.1
-_C.MODEL.ATTENTION_DROPOUT = 0.1
+_C.MODEL.RESUME = None # model path for resume training
+_C.MODEL.PRETRAINED = None # model path for loading pretrained weights
+_C.MODEL.NUM_CLASSES = 1000 # num of classes
+_C.MODEL.DROPOUT = 0.1 # dropout rate
+_C.MODEL.DROPPATH = 0.1 # drop path rate
+_C.MODEL.ATTENTION_DROPOUT = 0.1 # dropout rate for attention
 
 # transformer settings
 _C.MODEL.TRANS = CN()
@@ -53,20 +54,21 @@
 _C.MODEL.TRANS.EMBED_DIM = 768
 _C.MODEL.TRANS.MLP_RATIO= 4.0
 _C.MODEL.TRANS.NUM_HEADS = 12
+_C.MODEL.TRANS.ATTN_HEAD_SIZE = None
 _C.MODEL.TRANS.DEPTH = 12
 _C.MODEL.TRANS.QKV_BIAS = True
 
 # training settings
 _C.TRAIN = CN()
-_C.TRAIN.LAST_EPOCH = 0
-_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.LAST_EPOCH = 0 # set this for resuming training
+_C.TRAIN.NUM_EPOCHS = 300 # total num of epochs
 _C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
 _C.TRAIN.WEIGHT_DECAY = 0.05 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
+_C.TRAIN.BASE_LR = 0.003 #0.003 for pretrain # 0.03 for finetune
 _C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 5e-4
+_C.TRAIN.END_LR = 5e-4 # ending lr
 _C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.ACCUM_ITER = 1
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -83,13 +85,14 @@
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
-_C.SAVE_FREQ = 10 # freq to save chpt
-_C.REPORT_FREQ = 100 # freq to logging info
-_C.VALIDATE_FREQ = 100 # freq to do validation
-_C.SEED = 0
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 20 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0 # random seed for paddle, numpy and python
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
-_C.NGPUS = -1
+_C.NGPUS = -1 # usually set to -1, use CUDA_VISIBLE_DEVICES for GPU selections 
 
 
 def _update_config_from_file(config, cfg_file):
@@ -105,6 +108,7 @@ def _update_config_from_file(config, cfg_file):
     config.merge_from_file(cfg_file)
     config.freeze()
 
+
 def update_config(config, args):
     """Update config by ArgumentParser
     Args:
@@ -121,8 +125,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -133,7 +141,12 @@ def update_config(config, args):
     if args.resume:
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
-        config.MODEL.LAST_EPOCH = args.last_epoch
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/ViT/configs/vit_base_patch16_224.yaml b/image_classification/ViT/configs/vit_base_patch16_224.yaml
index eff0fc29..82408aec 100644
--- a/image_classification/ViT/configs/vit_base_patch16_224.yaml
+++ b/image_classification/ViT/configs/vit_base_patch16_224.yaml
@@ -18,4 +18,4 @@ TRAIN:
     BASE_LR: 0.003
     WARMUP_START_LR: 1e-6
     END_LR: 5e-4
-    ACCUM_ITER: 2
+    ACCUM_ITER: 1
diff --git a/image_classification/ViT/configs/vit_base_patch16_384.yaml b/image_classification/ViT/configs/vit_base_patch16_384.yaml
index 04cdfaee..cd449950 100644
--- a/image_classification/ViT/configs/vit_base_patch16_384.yaml
+++ b/image_classification/ViT/configs/vit_base_patch16_384.yaml
@@ -11,4 +11,3 @@ MODEL:
         DEPTH: 12
         NUM_HEADS: 12
         QKV_BIAS: true
-
diff --git a/image_classification/ViT/configs/vit_base_patch32_224.yaml b/image_classification/ViT/configs/vit_base_patch32_224.yaml
new file mode 100644
index 00000000..8b0516d2
--- /dev/null
+++ b/image_classification/ViT/configs/vit_base_patch32_224.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch32_224
+    TRANS:
+        PATCH_SIZE: 32
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
diff --git a/image_classification/ViT/configs/vit_base_patch32_384.yaml b/image_classification/ViT/configs/vit_base_patch32_384.yaml
new file mode 100644
index 00000000..5aa3e6f1
--- /dev/null
+++ b/image_classification/ViT/configs/vit_base_patch32_384.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_base_patch32_384
+    TRANS:
+        PATCH_SIZE: 32
+        EMBED_DIM: 768
+        MLP_RATIO: 4.0
+        DEPTH: 12
+        NUM_HEADS: 12
+        QKV_BIAS: true
+TRAIN:
+    NUM_EPOCHS: 300
+    WARMUP_EPOCHS: 3
+    WEIGHT_DECAY: 0.3
+    BASE_LR: 0.003
+    WARMUP_START_LR: 1e-6
+    END_LR: 5e-4
+    ACCUM_ITER: 2
diff --git a/image_classification/ViT/configs/vit_large_patch16_384.yaml b/image_classification/ViT/configs/vit_large_patch16_384.yaml
new file mode 100644
index 00000000..c8c01a6a
--- /dev/null
+++ b/image_classification/ViT/configs/vit_large_patch16_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_large_patch16_384
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 1024
+        MLP_RATIO: 4.0
+        DEPTH: 24
+        NUM_HEADS: 16
+        QKV_BIAS: true
+
diff --git a/image_classification/ViT/configs/vit_large_patch32_384.yaml b/image_classification/ViT/configs/vit_large_patch32_384.yaml
new file mode 100644
index 00000000..6b7f15aa
--- /dev/null
+++ b/image_classification/ViT/configs/vit_large_patch32_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: ViT
+    NAME: vit_large_patch32_384
+    TRANS:
+        PATCH_SIZE: 32
+        EMBED_DIM: 1024
+        MLP_RATIO: 4.0
+        DEPTH: 24
+        NUM_HEADS: 16
+        QKV_BIAS: true
+
diff --git a/image_classification/ViT/datasets.py b/image_classification/ViT/datasets.py
index e207f9ba..fc3e8bad 100644
--- a/image_classification/ViT/datasets.py
+++ b/image_classification/ViT/datasets.py
@@ -84,8 +84,7 @@ def get_train_transforms(config):
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
                                      scale=(0.05, 1.0)),
         transforms.ToTensor(),
-        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        #transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_train
 
@@ -109,8 +108,7 @@ def get_val_transforms(config):
         transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        #transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
diff --git a/image_classification/ViT/droppath.py b/image_classification/ViT/droppath.py
index 25b8d5ff..f5d3fcaa 100644
--- a/image_classification/ViT/droppath.py
+++ b/image_classification/ViT/droppath.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/image_classification/ViT/main_multi_gpu.py b/image_classification/ViT/main_multi_gpu.py
index 496b5957..fc61db3c 100644
--- a/image_classification/ViT/main_multi_gpu.py
+++ b/image_classification/ViT/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -25,52 +25,52 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from transformer import build_vit as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from config import get_config
 from config import update_config
+from transformer import build_vit as build_model
 
 
-parser = argparse.ArgumentParser('ViT')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -78,83 +78,143 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
-
-        loss.backward()
+        if amp is True:
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
         acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-    train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -169,56 +229,104 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, "+
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    """main method for each process"""
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+        filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+        logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+
+    # STEP 1: Create model
     model = build_model(config)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define criterion
     criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+
+    # STEP 4: Define optimizer and lr_scheduler
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -240,7 +348,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -270,76 +380,124 @@ def main_worker(*args):
             #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 5: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
-    
-    # 6. Validation
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 6: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
             criterion=criterion,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 7: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
                 criterion=criterion,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -347,15 +505,37 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                local_logger.info(f"----- Save model: {model_path}.pdparams")
+                local_logger.info(f"----- Save optim: {model_path}.pdopt")
+                if local_rank == 0:
+                    master_logger.info(f"----- Save model: {model_path}.pdparams")
+                    master_logger.info(f"----- Save optim: {model_path}.pdopt")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    """main method for spawning multi process training/validation"""
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/ViT/main_single_gpu.py b/image_classification/ViT/main_single_gpu.py
index ee7e6e1f..a2f26781 100644
--- a/image_classification/ViT/main_single_gpu.py
+++ b/image_classification/ViT/main_single_gpu.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -26,53 +26,50 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from transformer import build_vit as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from config import get_config
 from config import update_config
+from transformer import build_vit as build_model
 
 
-parser = argparse.ArgumentParser('ViT')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('ViT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,45 +77,62 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
+        if amp is True:
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
         acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
@@ -127,9 +141,9 @@ def train(dataloader,
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -138,19 +152,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -175,7 +190,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -187,21 +202,36 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
     # 3. Define criterion
     criterion = nn.CrossEntropyLoss()
     # 4. Define lr_scheduler
@@ -213,8 +243,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -226,9 +255,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -248,31 +277,35 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip)
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
@@ -280,26 +313,30 @@ def main():
             model=model,
             criterion=criterion,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -313,7 +350,8 @@ def main():
                 model=model,
                 criterion=criterion,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
diff --git a/image_classification/ViT/port_weights/__init__.py b/image_classification/ViT/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/ViT/port_weights/load_pytorch_weights.py b/image_classification/ViT/port_weights/load_pytorch_weights.py
index ffe1902c..e4b8a89e 100644
--- a/image_classification/ViT/port_weights/load_pytorch_weights.py
+++ b/image_classification/ViT/port_weights/load_pytorch_weights.py
@@ -21,8 +21,14 @@
 from config import *
 
 
-config = get_config('./configs/vit_base_patch16_224.yaml')
-print(config)
+#model_name = 'vit_base_patch32_224'
+#model_name = 'vit_base_patch32_384'
+
+model_name = 'vit_large_patch32_384'
+#model_name = 'vit_large_patch16_384'
+sz = int(model_name[-3::])
+
+config = get_config(f'./configs/{model_name}.yaml')
 
 
 def print_model_named_params(model):
@@ -47,7 +53,13 @@ def torch_to_paddle_mapping():
         ('patch_embed.proj', f'{prefix}.patch_embedding'),
     ]
 
-    num_layers = 12
+    if 'large' in model_name:
+        num_layers = 24
+    elif 'base' in model_name:
+        num_layers = 12
+    else:
+        raise ValueError('now only support large and base model conversion')
+
     for idx in range(num_layers):
         pp_prefix = f'encoder.layers.{idx}'
         th_prefix = f'blocks.{idx}'
@@ -129,7 +141,8 @@ def main():
 
     print('+++++++++++++++++++++++++++++++++++')
     device = torch.device('cpu')
-    torch_model = timm.create_model('vit_base_patch16_224', pretrained=True)
+    torch_model = timm.create_model(model_name, pretrained=True)
+    #torch_model = timm.create_model('vit_base_patch16_224', pretrained=True)
     torch_model = torch_model.to(device)
     torch_model.eval()
     print_model_named_params(torch_model)
@@ -139,7 +152,8 @@ def main():
     paddle_model = convert(torch_model, paddle_model)
 
     # check correctness
-    x = np.random.randn(2, 3, 224, 224).astype('float32')
+    x = np.random.randn(2, 3, sz, sz).astype('float32')
+    #x = np.random.randn(2, 3, 224, 224).astype('float32')
     x_paddle = paddle.to_tensor(x)
     x_torch = torch.Tensor(x).to(device)
 
@@ -156,7 +170,8 @@ def main():
     assert np.allclose(out_torch, out_paddle, atol = 1e-5)
     
     # save weights for paddle model
-    model_path = os.path.join('./vit_base_patch16_224.pdparams')
+    model_path = os.path.join(f'./{model_name}.pdparams')
+    #model_path = os.path.join('./vit_base_patch16_224.pdparams')
     paddle.save(paddle_model.state_dict(), model_path)
     print('all done')
 
diff --git a/image_classification/ViT/run_eval_multi.sh b/image_classification/ViT/run_eval_multi.sh
index efd9c34e..6ebcea8d 100644
--- a/image_classification/ViT/run_eval_multi.sh
+++ b/image_classification/ViT/run_eval_multi.sh
@@ -2,7 +2,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/vit_base_patch16_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=64 \
 -data_path='/dataset/imagenet' \
 -eval \
 -pretrained='./vit_base_patch16_224' \
diff --git a/image_classification/ViT/run_eval_multi_384.sh b/image_classification/ViT/run_eval_multi_384.sh
index 0a771aa8..93c3eaec 100644
--- a/image_classification/ViT/run_eval_multi_384.sh
+++ b/image_classification/ViT/run_eval_multi_384.sh
@@ -1,8 +1,8 @@
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
--cfg='./configs/vit_base_patch16_384.yaml' \
+-cfg='./configs/vit_base_patch32_384.yaml' \
 -dataset='imagenet2012' \
 -batch_size=32 \
 -data_path='/dataset/imagenet' \
 -eval \
--pretrained='./vit_base_patch16_384'
+-pretrained='./vit_base_patch32_384'
diff --git a/image_classification/ViT/run_train.sh b/image_classification/ViT/run_train.sh
index cfb5f0b1..edd7cf9c 100644
--- a/image_classification/ViT/run_train.sh
+++ b/image_classification/ViT/run_train.sh
@@ -4,3 +4,4 @@ python main_single_gpu.py \
 -dataset='imagenet2012' \
 -batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/ViT/run_train_multi.sh b/image_classification/ViT/run_train_multi.sh
index 93488e42..9997bb19 100644
--- a/image_classification/ViT/run_train_multi.sh
+++ b/image_classification/ViT/run_train_multi.sh
@@ -4,3 +4,4 @@ python main_multi_gpu.py \
 -dataset='imagenet2012' \
 -batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/ViT/stat_define.py b/image_classification/ViT/stat_define.py
new file mode 100644
index 00000000..ebce2722
--- /dev/null
+++ b/image_classification/ViT/stat_define.py
@@ -0,0 +1,61 @@
+import os
+import glob
+import paddle
+from config import get_config
+from transformer import build_vit as build_model
+
+def count_gelu(layer, input, output):
+    activation_flops = 8
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, input, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, input, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = input[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/vit_large_patch32_384.yaml'
+#input_size = (1, 3, 224, 224)
+input_size = (1, 3, 384, 384)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/ViT/transformer.py b/image_classification/ViT/transformer.py
index 24135988..e1fc9d93 100644
--- a/image_classification/ViT/transformer.py
+++ b/image_classification/ViT/transformer.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -20,7 +20,6 @@
 import paddle
 import paddle.nn as nn
 from droppath import DropPath
-from config import get_config
 
 
 class Identity(nn.Layer):
@@ -29,7 +28,7 @@ class Identity(nn.Layer):
     Use this layer to avoid using 'if' condition in forward methods
     """
     def __init__(self):
-        super(Identity, self).__init__()
+        super().__init__()
 
     def forward(self, x):
         return x
@@ -37,16 +36,13 @@ def forward(self, x):
 
 class PatchEmbedding(nn.Layer):
     """Patch Embedding and Position Embedding
-
     Apply patch embedding and position embedding on input images.
-
     Attributes:
         patch_embddings: impl using a patch_size x patch_size Conv2D operation
         position_embddings: a parameter with len = num_patch + 1(for cls_token)
         cls_token: token insert to the patch feature for classification
         dropout: dropout for embeddings
     """
-
     def __init__(self,
                  image_size=224,
                  patch_size=16,
@@ -62,7 +58,7 @@ def __init__(self,
                                          stride=patch_size)
 
         self.position_embeddings = paddle.create_parameter(
-            shape=[1, n_patches+1, embed_dim],
+            shape=[1, n_patches + 1, embed_dim],
             dtype='float32',
             default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
 
@@ -80,17 +76,15 @@ def forward(self, x):
         x = x.transpose([0, 2, 1])
         x = paddle.concat((cls_tokens, x), axis=1)
 
-        embeddings = x + self.position_embeddings # tensor broadcast
+        embeddings = x + self.position_embeddings  # tensor broadcast
         embeddings = self.dropout(embeddings)
         return embeddings
 
 
 class Attention(nn.Layer):
     """ Attention module
-
     Attention module for ViT, here q, k, v are assumed the same.
     The qkv mappings are stored as one single param.
-
     Attributes:
         num_heads: number of heads
         attn_head_size: feature dim of single head
@@ -105,24 +99,50 @@ class Attention(nn.Layer):
     def __init__(self,
                  embed_dim,
                  num_heads,
+                 attn_head_size=None,
                  qkv_bias=True,
                  dropout=0.,
                  attention_dropout=0.):
         super().__init__()
-        self.num_heads = num_heads 
-        self.attn_head_size = int(embed_dim / self.num_heads)
-        self.all_head_size = self.attn_head_size * self.num_heads
+
+        assert isinstance(embed_dim, int), (
+            f"Expected the type of `embed_dim` to be {int}, but received {type(embed_dim)}.")
+        assert isinstance(num_heads, int), (
+            f"Expected the type of `num_heads` to be {int}, but received {type(num_heads)}.")
+
+        assert embed_dim > 0, (
+            f"Expected `embed_dim` to be greater than 0, but received {embed_dim}")
+        assert num_heads > 0, (
+            f"Expected `num_heads` to be greater than 0, but received {num_heads}")
+
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+
+        if attn_head_size is not None:
+            assert isinstance(attn_head_size, int), (
+                f"Expected the type of `attn_head_size` to be {int}, "
+                f"but received {type(attn_head_size)}.")
+            assert attn_head_size > 0, f"Expected `attn_head_size` to be greater than 0," \
+                                       f" but received {attn_head_size}."
+            self.attn_head_size = attn_head_size
+        else:
+            self.attn_head_size = embed_dim // num_heads
+            assert self.attn_head_size * num_heads == embed_dim, (
+                f"`embed_dim` must be divisible by `num_heads`,"
+                f" but received embed_dim={embed_dim}, num_heads={num_heads}.")
+
+        self.all_head_size = self.attn_head_size * num_heads
 
         w_attr_1, b_attr_1 = self._init_weights()
         self.qkv = nn.Linear(embed_dim,
-                             self.all_head_size*3, #weights for q, k, and v
+                             self.all_head_size * 3,  # weights for q, k, and v
                              weight_attr=w_attr_1,
                              bias_attr=b_attr_1 if qkv_bias else False)
 
         self.scales = self.attn_head_size ** -0.5
 
         w_attr_2, b_attr_2 = self._init_weights()
-        self.out = nn.Linear(embed_dim,
+        self.out = nn.Linear(self.all_head_size,
                              embed_dim,
                              weight_attr=w_attr_2,
                              bias_attr=b_attr_2)
@@ -132,8 +152,8 @@ def __init__(self,
         self.softmax = nn.Softmax(axis=-1)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def transpose_multihead(self, x):
@@ -149,7 +169,6 @@ def forward(self, x):
         attn = paddle.matmul(q, k, transpose_y=True)
         attn = attn * self.scales
         attn = self.softmax(attn)
-        attn_weights = attn
         attn = self.attn_dropout(attn)
 
         z = paddle.matmul(attn, v)
@@ -159,15 +178,13 @@ def forward(self, x):
         # reshape
         z = self.out(z)
         z = self.proj_dropout(z)
-        return z, attn_weights
+        return z
 
 
 class Mlp(nn.Layer):
     """ MLP module
-
     Impl using nn.Linear and activation is GELU, dropout is applied.
     Ops: fc -> act -> dropout -> fc -> dropout
-
     Attributes:
         fc1: nn.Linear
         fc2: nn.Linear
@@ -175,6 +192,7 @@ class Mlp(nn.Layer):
         dropout1: dropout after fc1
         dropout2: dropout after fc2
     """
+
     def __init__(self,
                  embed_dim,
                  mlp_ratio,
@@ -197,9 +215,9 @@ def __init__(self,
 
     def _init_weights(self):
         weight_attr = paddle.ParamAttr(
-            initializer=paddle.nn.initializer.XavierUniform()) #default in pp: xavier
+            initializer=paddle.nn.initializer.TruncatedNormal(std=0.2))
         bias_attr = paddle.ParamAttr(
-            initializer=paddle.nn.initializer.Normal(std=1e-6)) #default in pp: zero
+            initializer=paddle.nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
@@ -213,9 +231,7 @@ def forward(self, x):
 
 class EncoderLayer(nn.Layer):
     """Encoder Layer
-
     Encoder layer contains attention, norm, mlp and residual
-
     Attributes:
         hidden_size: transformer feature dim
         attn_norm: nn.LayerNorm before attention
@@ -226,6 +242,7 @@ class EncoderLayer(nn.Layer):
     def __init__(self,
                  embed_dim,
                  num_heads,
+                 attn_head_size=None,
                  qkv_bias=True,
                  mlp_ratio=4.,
                  dropout=0.,
@@ -240,6 +257,7 @@ def __init__(self,
 
         self.attn = Attention(embed_dim,
                               num_heads,
+                              attn_head_size,
                               qkv_bias,
                               dropout,
                               attention_dropout)
@@ -254,14 +272,14 @@ def __init__(self,
         self.mlp = Mlp(embed_dim, mlp_ratio, dropout)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
         h = x
         x = self.attn_norm(x)
-        x, attn = self.attn(x)
+        x = self.attn(x)
         x = self.drop_path(x)
         x = x + h
 
@@ -271,38 +289,39 @@ def forward(self, x):
         x = self.drop_path(x)
         x = x + h
 
-        return x, attn
+        return x
 
 
 class Encoder(nn.Layer):
     """Transformer encoder
-
     Encoder encoder contains a list of EncoderLayer, and a LayerNorm.
-
     Attributes:
         layers: nn.LayerList contains multiple EncoderLayers
         encoder_norm: nn.LayerNorm which is applied after last encoder layer
     """
+
     def __init__(self,
                  embed_dim,
                  num_heads,
                  depth,
+                 attn_head_size=None,
                  qkv_bias=True,
                  mlp_ratio=4.0,
                  dropout=0.,
                  attention_dropout=0.,
                  droppath=0.):
-        super(Encoder, self).__init__()
+        super().__init__()
         # stochatic depth decay
         depth_decay = [x.item() for x in paddle.linspace(0, droppath, depth)]
         layer_list = []
         for i in range(depth):
             encoder_layer = EncoderLayer(embed_dim,
                                          num_heads,
-                                         qkv_bias=True,
-                                         mlp_ratio=4.,
-                                         dropout=0.,
-                                         attention_dropout=0.,
+                                         attn_head_size=attn_head_size,
+                                         qkv_bias=qkv_bias,
+                                         mlp_ratio=mlp_ratio,
+                                         dropout=dropout,
+                                         attention_dropout=attention_dropout,
                                          droppath=depth_decay[i])
             layer_list.append(copy.deepcopy(encoder_layer))
         self.layers = nn.LayerList(layer_list)
@@ -314,26 +333,22 @@ def __init__(self,
                                          epsilon=1e-6)
 
     def _init_weights(self):
-        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
-        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(1.0))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
-        self_attn = []
         for layer in self.layers:
-            x, attn = layer(x)
-            self_attn.append(attn)
+            x = layer(x)
         out = self.encoder_norm(x)
-        return out, self_attn
+        return out
 
 
 class VisualTransformer(nn.Layer):
     """ViT transformer
-
     ViT Transformer, classifier is a single Linear layer for finetune,
     For training from scratch, two layer mlp should be used.
     Classification is done using cls_token.
-
     Args:
         image_size: int, input image size, default: 224
         patch_size: int, patch size, default: 16
@@ -356,13 +371,14 @@ def __init__(self,
                  embed_dim=768,
                  depth=12,
                  num_heads=12,
+                 attn_head_size=None,
                  mlp_ratio=4,
                  qkv_bias=True,
                  dropout=0.,
                  attention_dropout=0.,
                  droppath=0.,
                  train_from_scratch=False):
-        super(VisualTransformer, self).__init__()
+        super().__init__()
         # create patch embedding with positional embedding
         self.patch_embedding = PatchEmbedding(image_size,
                                               patch_size,
@@ -373,6 +389,7 @@ def __init__(self,
         self.encoder = Encoder(embed_dim,
                                num_heads,
                                depth,
+                               attn_head_size,
                                qkv_bias,
                                mlp_ratio,
                                dropout,
@@ -384,20 +401,20 @@ def __init__(self,
             w_attr_1, b_attr_1 = self._init_weights()
             w_attr_2, b_attr_2 = self._init_weights()
             self.classifier = nn.Sequential(
-                                nn.Linear(config.MODEL.TRANS.HIDDEN_SIZE,
-                                          config.MODEL.TRANS.HIDDEN_SIZE,
-                                          weight_attr=w_attr_1,
-                                          bias_attr=b_attr_1),
-                                nn.ReLU(),
-                                nn.Dropout(config.MODEL.DROPOUT),
-                                nn.Linear(config.MODEL.TRANS.HIDDEN_SIZE,
-                                          config.MODEL.NUM_CLASSES,
-                                          weight_attr=w_attr_2,
-                                          bias_attr=b_attr_2),
-                                nn.Dropout(config.MODEL.DROPOUT),
-                                )
+                nn.Linear(embed_dim,
+                          embed_dim,
+                          weight_attr=w_attr_1,
+                          bias_attr=b_attr_1),
+                nn.ReLU(),
+                nn.Dropout(dropout),
+                nn.Linear(embed_dim,
+                          num_classes,
+                          weight_attr=w_attr_2,
+                          bias_attr=b_attr_2),
+                nn.Dropout(dropout),
+            )
         else:
-        # classifier head (for finetuning)
+            # classifier head (for finetuning)
             w_attr_1, b_attr_1 = self._init_weights()
             self.classifier = nn.Linear(embed_dim,
                                         num_classes,
@@ -406,26 +423,28 @@ def __init__(self,
 
     def _init_weights(self):
         weight_attr = paddle.ParamAttr(
-            initializer=paddle.nn.initializer.KaimingUniform())
+            initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
         bias_attr = paddle.ParamAttr(
-            initializer=paddle.nn.initializer.KaimingUniform())
+            initializer=paddle.nn.initializer.Constant(0.0))
         return weight_attr, bias_attr
 
     def forward(self, x):
         x = self.patch_embedding(x)
-        x, attn = self.encoder(x)
-        logits = self.classifier(x[:, 0]) # take only cls_token as classifier
+        x = self.encoder(x)
+        logits = self.classifier(x[:, 0])  # take only cls_token as classifier
         return logits
 
 
 def build_vit(config):
+    """build vit model from config"""
     model = VisualTransformer(image_size=config.DATA.IMAGE_SIZE,
                               patch_size=config.MODEL.TRANS.PATCH_SIZE,
-                              in_channels=3,
+                              in_channels=config.DATA.IMAGE_CHANNELS,
                               num_classes=config.MODEL.NUM_CLASSES,
                               embed_dim=config.MODEL.TRANS.EMBED_DIM,
                               depth=config.MODEL.TRANS.DEPTH,
                               num_heads=config.MODEL.TRANS.NUM_HEADS,
+                              attn_head_size=config.MODEL.TRANS.ATTN_HEAD_SIZE,
                               mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
                               qkv_bias=config.MODEL.TRANS.QKV_BIAS,
                               dropout=config.MODEL.DROPOUT,
diff --git a/image_classification/ViT/utils.py b/image_classification/ViT/utils.py
index 44800527..ab0345aa 100644
--- a/image_classification/ViT/utils.py
+++ b/image_classification/ViT/utils.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/image_classification/XCiT/README.md b/image_classification/XCiT/README.md
new file mode 100644
index 00000000..1a19f12d
--- /dev/null
+++ b/image_classification/XCiT/README.md
@@ -0,0 +1,194 @@
+# XCiT: Cross-Covariance Image Transformer, [arxiv](https://arxiv.org/pdf/2106.09681.pdf) 
+
+PaddlePaddle training/validation code and pretrained models for **XCiT**.
+
+The official pytorch implementation is [here](https://github.com/facebookresearch/xcit).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT.git).
+
+<p align="center">
+<img src="./xcit.png" alt="drawing" width="80%" height="80%"/>
+    <h4 align="center">XCiT Model Overview</h4>
+</p>
+
+
+
+### Update 
+
+* Update (2021-12-8): Code is updated and ported weights are uploaded.
+* Update (2021-11-7): Code is released
+
+## Models Zoo
+
+| Model                       | Acc@1  | Acc@5  | #Params | FLOPs | Image Size | Crop_pct | Interpolation | Link |
+| --------------------------- | ------ | ------ | ------- | ----- | ---------- | -------- | ------------- | ---- |
+| xcit_nano_12_p16_224_dist   | 72.32  | 90.86  | 0.6G    | 3.1M      | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/14FsYtm48JB-rQFF9CanJsJaPESniWD7q/view?usp=sharing)/[baidu](https://pan.baidu.com/s/15kdY4vzwU2QiBSU5127AYA)(7qvz)     |
+| xcit_nano_12_p16_384_dist   | 75.46  | 92.70  | 1.6G    | 3.1M      | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1zR-hFQryocF9muG-erzcxFuJme5y_e9f/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1449qtQzEMg6lqdtClyiCRQ)(1y2j)     |
+| xcit_large_24_p16_224_dist  | 84.92  | 97.13  | 35.9G   | 189.1M    | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1lAtko_KwOagjwaFvUkeXirVClXCV8gt-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1Gs401mXqG1bifi1hBdXtig)(kfv8)     |
+| xcit_large_24_p16_384_dist  | 85.76  | 97.54  | 105.5G  | 189.1M    | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/15djnKz_-eooncvyZp_UTwOiHIm1Hxo_G/view?usp=sharing)/[baidu](https://pan.baidu.com/s/14583hbtIVbZ_2ifZepQItQ)(ffq3)     |
+| xcit_nano_12_p8_224_dist    | 76.33  | 93.10  | 2.2G    | 3.0M      | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1XxRNjskLvSVp6lvhlsnylq6g7vd_5MsI/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DZJxuahFJyz-rEEsCqhhrA)(jjs7)     |
+| xcit_nano_12_p8_384_dist    | 77.82  | 94.04  | 6.3G    | 3.0M      | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1P3ln8JqLzMKbJAhCanRbu7i5NMPVFNec/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ECY9-PVDMNSup8NMQiqBrw)(dmc1)     |
+| xcit_large_24_p8_224_dist   | 85.40  | 97.40  | 141.4G  | 188.9M    | 224        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/14ZoDxEez5NKVNAsbgjTPisjOQEAA30Wy/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1D_zyvjzIVFp6iqx1s7IEbA)(y7gw)     |
+| xcit_large_24_p8_384_dist   | 85.99  | 97.69  | 415.5G  | 188.9M    | 384        | 1.0      | bicubic       | [google](https://drive.google.com/file/d/1stcUwwFNJ38mdaFsNXq24CBMmDenJ_e4/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1lwbBk7GFuqnnP_iU2OuDRw)(9xww)     |
+> *The results are evaluated on ImageNet2012 validation set.
+
+## Notebooks
+
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+
+ImageNet2012 dataset is used in the following folder structure:
+
+```
+│imagenet/
+├──train/
+│  ├── n01440764
+│  │   ├── n01440764_10026.JPEG
+│  │   ├── n01440764_10027.JPEG
+│  │   ├── ......
+│  ├── ......
+├──val/
+│  ├── n01440764
+│  │   ├── ILSVRC2012_val_00000293.JPEG
+│  │   ├── ILSVRC2012_val_00002138.JPEG
+│  │   ├── ......
+│  ├── ......
+```
+
+## Usage
+
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./swin_base_patch4_window7_224.pdparams`, to use the `swin_base_patch4_window7_224` model in python:
+
+```python
+from config import get_config
+from xcit import build_xcit as build_model
+# config files in ./configs/
+config = get_config('./configs/xcit_nano_12_p16_224.yaml')
+# build model
+model = build_model(config)
+# load pretrained weights, .pdparams is NOT needed
+model_state_dict = paddle.load('./xcit_nano_12_p16_224_dist')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+
+To evaluate XCiT model performance on ImageNet2012 with a single GPU, run the following script using command line:
+
+```shell
+sh run_eval.sh
+```
+
+or
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg='./configs/xcit_nano_12_p16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./xcit_nano_12_p16_224_dist'
+```
+
+<details>
+
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+
+or
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/xcit_nano_12_p16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+    -eval \
+    -pretrained='./xcit_nano_12_p16_224_dist'
+```
+
+</details>
+
+
+## Training
+
+To train the XCiT model on ImageNet2012 with single GPU, run the following script using command line:
+
+```shell
+sh run_train.sh
+```
+
+or
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_singel_gpu.py \
+  -cfg='./configs/xcit_nano_12_p16_224.yaml' \
+  -dataset='imagenet2012' \
+  -batch_size=32 \
+  -data_path='/dataset/imagenet' \
+```
+
+<details>
+
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+
+or
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg='./configs/xcit_nano_12_p16_224.yaml' \
+    -dataset='imagenet2012' \
+    -batch_size=16 \
+    -data_path='/dataset/imagenet' \
+```
+
+</details>
+
+
+## Visualization Attention Map
+
+**(coming soon)**
+
+## Reference
+
+```
+@article{el2021xcit,
+  title={XCiT: Cross-Covariance Image Transformers},
+  author={El-Nouby, Alaaeldin and Touvron, Hugo and Caron, Mathilde and Bojanowski, Piotr and Douze, Matthijs and Joulin, Armand and Laptev, Ivan and Neverova, Natalia and Synnaeve, Gabriel and Verbeek, Jakob and others},
+  journal={arXiv preprint arXiv:2106.09681},
+  year={2021}
+}
+```
diff --git a/image_classification/XCiT/__init__.py b/image_classification/XCiT/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/XCiT/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/XCiT/augment.py b/image_classification/XCiT/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/XCiT/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/XCiT/config.py b/image_classification/XCiT/config.py
new file mode 100644
index 00000000..2be81d34
--- /dev/null
+++ b/image_classification/XCiT/config.py
@@ -0,0 +1,180 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 8 #1024 batch_size for single GPU
+_C.DATA.DATA_PATH = '/dataset/imagenet/' # path to dataset
+_C.DATA.DATASET = 'imagenet2012' # dataset name
+_C.DATA.IMAGE_SIZE = 224 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'XCiT'
+_C.MODEL.NAME = 'XCiT'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PATCH_SIZE = 16
+_C.MODEL.TRANS.EMBED_DIM = 128
+_C.MODEL.TRANS.DEPTH = 12
+_C.MODEL.TRANS.NUM_HEADS = 4 
+_C.MODEL.TRANS.ETA = 1.0
+_C.MODEL.TRANS.TOKENS_NORM = False
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 400
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 0.0
+_C.TRAIN.END_LR = 0.0
+_C.TRAIN.GRAD_CLIP = 1.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'AdamW'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4 # color jitter factor
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = True
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25 # random erase prob
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel' # random erase mode
+_C.TRAIN.RANDOM_ERASE_COUNT = 1 # random erase count
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 1 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 10 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.image_size:
+        config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/image_classification/XCiT/configs/xcit_large_24_p16_224.yaml b/image_classification/XCiT/configs/xcit_large_24_p16_224.yaml
new file mode 100644
index 00000000..fb61d896
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_large_24_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_large_24_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        DEPTH: 24
+        NUM_HEADS: 16
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_large_24_p16_384.yaml b/image_classification/XCiT/configs/xcit_large_24_p16_384.yaml
new file mode 100644
index 00000000..9e10a99f
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_large_24_p16_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_large_24_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 768
+        DEPTH: 24
+        NUM_HEADS: 16
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_large_24_p8_224.yaml b/image_classification/XCiT/configs/xcit_large_24_p8_224.yaml
new file mode 100644
index 00000000..7c19c13a
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_large_24_p8_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_large_24_p16_224
+    TRANS:
+        PATCH_SIZE: 8
+        EMBED_DIM: 768
+        DEPTH: 24
+        NUM_HEADS: 16
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_large_24_p8_384.yaml b/image_classification/XCiT/configs/xcit_large_24_p8_384.yaml
new file mode 100644
index 00000000..b885d2f6
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_large_24_p8_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_large_24_p16_224
+    TRANS:
+        PATCH_SIZE: 8
+        EMBED_DIM: 768
+        DEPTH: 24
+        NUM_HEADS: 16
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_medium_24_p16_224.yaml b/image_classification/XCiT/configs/xcit_medium_24_p16_224.yaml
new file mode 100644
index 00000000..39c5aaf9
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_medium_24_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_medium_24_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 512
+        DEPTH: 24
+        NUM_HEADS: 8
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_nano_12_p16_224.yaml b/image_classification/XCiT/configs/xcit_nano_12_p16_224.yaml
new file mode 100644
index 00000000..1d8c0890
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_nano_12_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_nano_12_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 128
+        DEPTH: 12
+        NUM_HEADS: 4
+        ETA: 1.0
+        TOKENS_NORM: False
+
diff --git a/image_classification/XCiT/configs/xcit_nano_12_p16_384.yaml b/image_classification/XCiT/configs/xcit_nano_12_p16_384.yaml
new file mode 100644
index 00000000..24b80169
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_nano_12_p16_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_nano_12_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 128
+        DEPTH: 12
+        NUM_HEADS: 4
+        ETA: 1.0
+        TOKENS_NORM: False
+
diff --git a/image_classification/XCiT/configs/xcit_nano_12_p8_224.yaml b/image_classification/XCiT/configs/xcit_nano_12_p8_224.yaml
new file mode 100644
index 00000000..8bb77071
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_nano_12_p8_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_nano_12_p16_224
+    TRANS:
+        PATCH_SIZE: 8
+        EMBED_DIM: 128
+        DEPTH: 12
+        NUM_HEADS: 4
+        ETA: 1.0
+        TOKENS_NORM: False
+
diff --git a/image_classification/XCiT/configs/xcit_nano_12_p8_384.yaml b/image_classification/XCiT/configs/xcit_nano_12_p8_384.yaml
new file mode 100644
index 00000000..ce70fc69
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_nano_12_p8_384.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 384
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_nano_12_p16_224
+    TRANS:
+        PATCH_SIZE: 8
+        EMBED_DIM: 128
+        DEPTH: 12
+        NUM_HEADS: 4
+        ETA: 1.0
+        TOKENS_NORM: False
+
diff --git a/image_classification/XCiT/configs/xcit_small_12_p16_224.yaml b/image_classification/XCiT/configs/xcit_small_12_p16_224.yaml
new file mode 100644
index 00000000..8e5f59b0
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_small_12_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_small_12_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        DEPTH: 12
+        NUM_HEADS: 8
+        ETA: 1.0
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_small_24_p16_224.yaml b/image_classification/XCiT/configs/xcit_small_24_p16_224.yaml
new file mode 100644
index 00000000..92ed1017
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_small_24_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_small_24_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 384
+        DEPTH: 24
+        NUM_HEADS: 8
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_tiny_12_p16_224.yaml b/image_classification/XCiT/configs/xcit_tiny_12_p16_224.yaml
new file mode 100644
index 00000000..e66b3799
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_tiny_12_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_tiny_12_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        DEPTH: 12
+        NUM_HEADS: 4
+        ETA: 1.0
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/configs/xcit_tiny_24_p16_224.yaml b/image_classification/XCiT/configs/xcit_tiny_24_p16_224.yaml
new file mode 100644
index 00000000..e94c8d2e
--- /dev/null
+++ b/image_classification/XCiT/configs/xcit_tiny_24_p16_224.yaml
@@ -0,0 +1,14 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 1.0
+MODEL:
+    TYPE: xcit
+    NAME: xcit_tiny_24_p16_224
+    TRANS:
+        PATCH_SIZE: 16
+        EMBED_DIM: 192
+        DEPTH: 24
+        NUM_HEADS: 4
+        ETA: 1e-5
+        TOKENS_NORM: True
+
diff --git a/image_classification/XCiT/datasets.py b/image_classification/XCiT/datasets.py
new file mode 100644
index 00000000..052de4ef
--- /dev/null
+++ b/image_classification/XCiT/datasets.py
@@ -0,0 +1,220 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset related classes and methods for ViT training and validation
+Cifar10, Cifar100 and ImageNet2012 are supported
+"""
+
+import os
+import math
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from random_erasing import RandomErasing
+
+
+class ImageNet2012Dataset(Dataset):
+    """Build ImageNet2012 dataset
+
+    This class gets train/val imagenet datasets, which loads transfomed data and labels.
+
+    Attributes:
+        file_folder: path where imagenet images are stored
+        transform: preprocessing ops to apply on image
+        img_path_list: list of full path of images in whole dataset
+        label_list: list of labels of whole dataset
+    """
+
+    def __init__(self, file_folder, mode="train", transform=None):
+        """Init ImageNet2012 Dataset with dataset file path, mode(train/val), and transform"""
+        super(ImageNet2012Dataset, self).__init__()
+        assert mode in ["train", "val"]
+        self.file_folder = file_folder
+        self.transform = transform
+        self.img_path_list = []
+        self.label_list = []
+
+        if mode == "train":
+            self.list_file = os.path.join(self.file_folder, "train_list.txt")
+        else:
+            self.list_file = os.path.join(self.file_folder, "val_list.txt")
+
+        with open(self.list_file, 'r') as infile:
+            for line in infile:
+                img_path = line.strip().split()[0]
+                img_label = int(line.strip().split()[1])
+                self.img_path_list.append(os.path.join(self.file_folder, img_path))
+                self.label_list.append(img_label)
+        print(f'----- Imagenet2012 image {mode} list len = {len(self.label_list)}')
+
+    def __len__(self):
+        return len(self.label_list)
+
+    def __getitem__(self, index):
+        data = Image.open(self.img_path_list[index]).convert('RGB')
+        data = self.transform(data)
+        label = self.label_list[index]
+
+        return data, label
+
+
+def get_train_transforms(config):
+    """ Get training transforms
+
+    For training, a RandomResizedCrop is applied, then normalization is applied with
+    [0.5, 0.5, 0.5] mean and std. The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+    aug_op_list = []
+    # random crop and resize
+    aug_op_list.append(
+        transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
+                                     scale=(0.05, 1.0)))
+    # auto_augment / color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER),) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+
+    transforms_train = transforms.Compose(aug_op_list)
+    return transforms_train
+
+
+def get_val_transforms(config):
+    """ Get training transforms
+
+    For validation, image is first Resize then CenterCrop to image_size.
+    Then normalization is applied with [0.5, 0.5, 0.5] mean and std.
+    The input pixel values must be rescaled to [0, 1.]
+    Outputs is converted to tensor
+
+    Args:
+        config: configs contains IMAGE_SIZE, see config.py for details
+    Returns:
+        transforms_train: training transforms
+    """
+
+    scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
+    transforms_val = transforms.Compose([
+        transforms.Resize(scale_size, interpolation='bicubic'),
+        transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
+    ])
+    return transforms_val
+
+
+def get_dataset(config, mode='train'):
+    """ Get dataset from config and mode (train/val)
+
+    Returns the related dataset object according to configs and mode(train/val)
+
+    Args:
+        config: configs contains dataset related settings. see config.py for details
+    Returns:
+        dataset: dataset object
+    """
+
+    assert mode in ['train', 'val']
+    if config.DATA.DATASET == "cifar10":
+        if mode == 'train':
+            dataset = datasets.Cifar10(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar10(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "cifar100":
+        if mode == 'train':
+            dataset = datasets.Cifar100(mode=mode, transform=get_train_transforms(config))
+        else:
+            mode = 'test'
+            dataset = datasets.Cifar100(mode=mode, transform=get_val_transforms(config))
+    elif config.DATA.DATASET == "imagenet2012":
+        if mode == 'train':
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_train_transforms(config))
+        else:
+            dataset = ImageNet2012Dataset(config.DATA.DATA_PATH,
+                                          mode=mode,
+                                          transform=get_val_transforms(config))
+    else:
+        raise NotImplementedError(
+            "[{config.DATA.DATASET}] Only cifar10, cifar100, imagenet2012 are supported now")
+    return dataset
+
+
+def get_dataloader(config, dataset, mode='train', multi_process=False):
+    """Get dataloader with config, dataset, mode as input, allows multiGPU settings.
+
+        Multi-GPU loader is implements as distributedBatchSampler.
+
+    Args:
+        config: see config.py for details
+        dataset: paddle.io.dataset object
+        mode: train/val
+        multi_process: if True, use DistributedBatchSampler to support multi-processing
+    Returns:
+        dataloader: paddle.io.DataLoader object.
+    """
+
+    if mode == 'train':
+        batch_size = config.DATA.BATCH_SIZE
+    else:
+        batch_size = config.DATA.BATCH_SIZE_EVAL
+
+    if multi_process is True:
+        sampler = DistributedBatchSampler(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'))
+        dataloader = DataLoader(dataset,
+                                batch_sampler=sampler,
+                                num_workers=config.DATA.NUM_WORKERS)
+    else:
+        dataloader = DataLoader(dataset,
+                                batch_size=batch_size,
+                                num_workers=config.DATA.NUM_WORKERS,
+                                shuffle=(mode == 'train'))
+    return dataloader
diff --git a/image_classification/SwinTransformer/drop.py b/image_classification/XCiT/drop.py
similarity index 100%
rename from image_classification/SwinTransformer/drop.py
rename to image_classification/XCiT/drop.py
diff --git a/image_classification/XCiT/losses.py b/image_classification/XCiT/losses.py
new file mode 100644
index 00000000..f67780a2
--- /dev/null
+++ b/image_classification/XCiT/losses.py
@@ -0,0 +1,121 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
diff --git a/image_classification/XCiT/main_multi_gpu.py b/image_classification/XCiT/main_multi_gpu.py
new file mode 100644
index 00000000..7e27131f
--- /dev/null
+++ b/image_classification/XCiT/main_multi_gpu.py
@@ -0,0 +1,584 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""XCiT training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from xcit import build_xcit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('XCiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        master_train_loss_meter.avg: float, average loss on all processes/gpus
+        master_train_acc_meter.avg: float, average top1 accuracy on all processes/gpus
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = paddle.to_tensor(image.shape[0])
+
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
+
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
+
+        if batch_id % debug_steps == 0:
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = paddle.to_tensor(image.shape[0])
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
+
+            val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+            val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
+            val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
+
+            if batch_id % debug_steps == 0:
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
+    val_time = time.time() - time_st
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
+
+
+def main_worker(*args):
+    # STEP 0: Preparation
+    config = args[0]
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
+    model = build_model(config)
+    model = paddle.DataParallel(model)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
+    dataloader_val = get_dataloader(config, dataset_val, 'test', True)
+    total_batch_val = len(dataloader_val)
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+    
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
+        scheduler.step()
+
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
+    dataset_val = get_dataset(config, mode='val')
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/XCiT/main_single_gpu.py b/image_classification/XCiT/main_single_gpu.py
new file mode 100644
index 00000000..625ed202
--- /dev/null
+++ b/image_classification/XCiT/main_single_gpu.py
@@ -0,0 +1,426 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""XCiT training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from datasets import get_dataloader
+from datasets import get_dataset
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
+from config import get_config
+from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from xcit import build_xcit as build_model
+
+
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('XCiT')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
+
+
+def train(dataloader,
+          model,
+          criterion,
+          optimizer,
+          epoch,
+          total_epochs,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        epoch: int, current epoch
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
+    Returns:
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
+    """
+    model.train()
+    train_loss_meter = AverageMeter()
+    train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
+    time_st = time.time()
+
+    for batch_id, data in enumerate(dataloader):
+        image = data[0]
+        label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
+
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
+
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
+
+        batch_size = image.shape[0]
+        train_loss_meter.update(loss.numpy()[0], batch_size)
+        train_acc_meter.update(acc.numpy()[0], batch_size)
+
+        if logger and batch_id % debug_steps == 0:
+            logger.info(
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                f"Avg Acc: {train_acc_meter.avg:.4f}")
+
+    train_time = time.time() - time_st
+    return train_loss_meter.avg, train_acc_meter.avg, train_time
+
+
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: nn.criterion
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
+    Returns:
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
+    """
+    model.eval()
+    val_loss_meter = AverageMeter()
+    val_acc1_meter = AverageMeter()
+    val_acc5_meter = AverageMeter()
+    time_st = time.time()
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            image = data[0]
+            label = data[1]
+
+            output = model(image)
+            loss = criterion(output, label)
+
+            pred = F.softmax(output)
+            acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
+            acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
+
+            batch_size = image.shape[0]
+            val_loss_meter.update(loss.numpy()[0], batch_size)
+            val_acc1_meter.update(acc1.numpy()[0], batch_size)
+            val_acc5_meter.update(acc5.numpy()[0], batch_size)
+
+            if logger and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+
+    val_time = time.time() - time_st
+    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+
+
+def main():
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
+    model = build_model(config)
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
+    dataset_val = get_dataset(config, mode='val')
+    dataloader_val = get_dataloader(config, dataset_val, 'val', False)
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            apply_decay_param_fun=get_exclude_from_weight_decay_fn([
+                'absolute_pos_embed', 'relative_position_bias_table']),
+            )
+    else:
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+        model.set_dict(model_state)
+        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
+
+    # STEP 7: Validation (eval mode)
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, val_time = validate(
+            dataloader=dataloader_val,
+            model=model,
+            criterion=criterion_val,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
+        logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                    f"Validation Acc@1: {val_acc1:.4f}, " +
+                    f"Validation Acc@5: {val_acc5:.4f}, " +
+                    f"time: {val_time:.2f}")
+        return
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
+                                                  model=model,
+                                                  criterion=criterion,
+                                                  optimizer=optimizer,
+                                                  epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
+                                                  total_batch=len(dataloader_train),
+                                                  debug_steps=config.REPORT_FREQ,
+                                                  accum_iter=config.TRAIN.ACCUM_ITER,
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss: {train_loss:.4f}, " +
+                    f"Train Acc: {train_acc:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, val_time = validate(
+                dataloader=dataloader_val,
+                model=model,
+                criterion=criterion_val,
+                total_batch=len(dataloader_val),
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
+            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                        f"Validation Loss: {val_loss:.4f}, " +
+                        f"Validation Acc@1: {val_acc1:.4f}, " +
+                        f"Validation Acc@5: {val_acc5:.4f}, " +
+                        f"time: {val_time:.2f}")
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(
+                config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/image_classification/XCiT/mixup.py b/image_classification/XCiT/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/XCiT/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/XCiT/random_erasing.py b/image_classification/XCiT/random_erasing.py
new file mode 100644
index 00000000..80d31dd8
--- /dev/null
+++ b/image_classification/XCiT/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, inputs):
+        if len(inputs.shape) == 3:
+            self._erase(inputs, *inputs.shape, inputs.dtype)
+        else:
+            batch_size, chan, img_h, img_w = inputs.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(inputs[i], chan, img_h, img_w, inputs.dtype)
+        return inputs
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/XCiT/run_eval.sh b/image_classification/XCiT/run_eval.sh
new file mode 100644
index 00000000..8548f5db
--- /dev/null
+++ b/image_classification/XCiT/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/xcit_large_24_p8_384.yaml' \
+-dataset='imagenet2012' \
+-batch_size=64 \
+-data_path='/dataset/imagenet' \
+-eval \
+-pretrained='./xcit_large_24_p8_384_dist' \
diff --git a/image_classification/XCiT/run_train.sh b/image_classification/XCiT/run_train.sh
new file mode 100644
index 00000000..b6badd06
--- /dev/null
+++ b/image_classification/XCiT/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/xcit_nano_12_p8_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/XCiT/run_train_multi.sh b/image_classification/XCiT/run_train_multi.sh
new file mode 100644
index 00000000..ae378344
--- /dev/null
+++ b/image_classification/XCiT/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/xcit_nano_12_p8_224.yaml' \
+-dataset='imagenet2012' \
+-batch_size=8 \
+-data_path='/dataset/imagenet' \
diff --git a/image_classification/XCiT/stat_define.py b/image_classification/XCiT/stat_define.py
new file mode 100644
index 00000000..e45be956
--- /dev/null
+++ b/image_classification/XCiT/stat_define.py
@@ -0,0 +1,60 @@
+import os
+import glob
+import paddle
+from config import get_config
+from swin_transformer import build_swin as build_model
+
+def count_gelu(layer, inputs, output):
+    activation_flops = 8
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * activation_flops 
+
+
+def count_softmax(layer, inputs, output):
+    softmax_flops = 5 # max/substract, exp, sum, divide
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * softmax_flops 
+
+
+def count_layernorm(layer, inputs, output):
+    layer_norm_flops = 5 # get mean (sum), get variance (square and sum), scale(multiply)
+    x = inputs[0]
+    num = x.numel()
+    layer.total_ops += num * layer_norm_flops 
+
+
+cfg = './configs/swin_tiny_patch4_window7_224.yaml'
+input_size = (1, 3, 224, 224)
+config = get_config(cfg)
+model = build_model(config)
+
+custom_ops = {paddle.nn.GELU: count_gelu,
+              paddle.nn.LayerNorm: count_layernorm,
+              paddle.nn.Softmax: count_softmax,
+            }
+print(os.path.basename(cfg))
+paddle.flops(model,
+             input_size=input_size,
+             custom_ops=custom_ops,
+             print_detail=False)
+
+
+#for cfg in glob.glob('./configs/*.yaml'):
+#    #cfg = './configs/swin_base_patch4_window7_224.yaml'
+#    input_size = (1, 3, int(cfg[-8:-5]), int(cfg[-8:-5]))
+#    config = get_config(cfg)
+#    model = build_model(config)
+#    
+#    
+#    custom_ops = {paddle.nn.GELU: count_gelu,
+#                  paddle.nn.LayerNorm: count_layernorm,
+#                  paddle.nn.Softmax: count_softmax,
+#                }
+#    print(os.path.basename(cfg))
+#    paddle.flops(model,
+#                 input_size=input_size,
+#                 custom_ops=custom_ops,
+#                 print_detail=False)
+#    print('-----------')
diff --git a/image_classification/XCiT/utils.py b/image_classification/XCiT/utils.py
new file mode 100644
index 00000000..44800527
--- /dev/null
+++ b/image_classification/XCiT/utils.py
@@ -0,0 +1,120 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""utils for ViT
+
+Contains AverageMeter for monitoring, get_exclude_from_decay_fn for training
+and WarmupCosineScheduler for training
+
+"""
+
+import math
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+
+def get_exclude_from_weight_decay_fn(exclude_list=[]):
+    """ Set params with no weight decay during the training
+
+    For certain params, e.g., positional encoding in ViT, weight decay
+    may not needed during the learning, this method is used to find
+    these params.
+
+    Args:
+        exclude_list: a list of params names which need to exclude
+                      from weight decay.
+    Returns:
+        exclude_from_weight_decay_fn: a function returns True if param
+                                      will be excluded from weight decay
+    """
+    if len(exclude_list) == 0:
+        exclude_from_weight_decay_fn = None
+    else:
+        def exclude_fn(param):
+            for name in exclude_list:
+                if param.endswith(name):
+                    return False
+            return True
+        exclude_from_weight_decay_fn = exclude_fn
+    return exclude_from_weight_decay_fn
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
diff --git a/image_classification/XCiT/xcit.png b/image_classification/XCiT/xcit.png
new file mode 100644
index 00000000..510f9397
Binary files /dev/null and b/image_classification/XCiT/xcit.png differ
diff --git a/image_classification/XCiT/xcit.py b/image_classification/XCiT/xcit.py
new file mode 100644
index 00000000..8936d6ef
--- /dev/null
+++ b/image_classification/XCiT/xcit.py
@@ -0,0 +1,596 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for XCiT
+"""
+
+import math
+from functools import partial
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from drop import DropPath
+
+
+trunc_normal_ = nn.initializer.TruncatedNormal(std=0.02)
+zeros_ = nn.initializer.Constant(value=0.0)
+ones_ = nn.initializer.Constant(value=1.0)
+
+
+class Mlp(nn.Layer):
+    """MLP module
+    MLP using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc1 -> act -> dropout -> fc2 -> dropout
+    """
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.0):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class Identity(nn.Layer):
+    """Identity layer
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, inputs):
+        return inputs
+
+
+class PositionalEncodingFourier(nn.Layer):
+    """
+    Positional encoding relying on a fourier kernel matching the one used in the
+    "Attention is all of Need" paper.
+    """
+    def __init__(self, hidden_dim=32, dim=768, temperature=10000):
+        super().__init__()
+        self.token_projection = nn.Conv2D(hidden_dim * 2, dim, kernel_size=1)
+        self.scale = 2 * math.pi
+        self.temperature = temperature
+        self.hidden_dim = hidden_dim
+        self.dim = dim
+
+    def forward(self, B, H, W):
+        mask = paddle.zeros([B, H, W]).astype("bool")
+        not_mask = paddle.logical_not(mask)
+        y_embed = not_mask.cumsum(1, dtype="float32")
+        x_embed = not_mask.cumsum(2, dtype="float32")
+        eps = 1e-6
+        y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
+        x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale
+
+        dim_t = paddle.arange(self.hidden_dim, dtype="int64")
+        dim_t = self.temperature ** (2 * (dim_t // 2) / self.hidden_dim)
+
+        pos_x = x_embed.unsqueeze(3) / dim_t
+        pos_y = y_embed.unsqueeze(3) / dim_t
+        pos_x = paddle.stack(
+            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), axis=4).flatten(3)
+        pos_y = paddle.stack(
+            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), axis=4).flatten(3)
+        pos = paddle.concat((pos_y, pos_x), axis=3).transpose([0, 3, 1, 2])
+        pos = self.token_projection(pos)
+        return pos
+
+
+def conv3x3(in_planes, out_planes, stride=1):
+    """3x3 convolution with padding"""
+    return paddle.nn.Sequential(
+        nn.Conv2D(in_planes,
+                  out_planes,
+                  kernel_size=3,
+                  stride=stride,
+                  padding=1,
+                  bias_attr=False),
+        nn.BatchNorm2D(out_planes))
+
+
+class ConvPatchEmbed(nn.Layer):
+    """ Image to Patch Embedding using multiple convolutional layers
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
+        super().__init__()
+        img_size = (img_size, img_size)
+        patch_size = (patch_size, patch_size)
+        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+
+        if patch_size[0] == 16:
+            self.proj = paddle.nn.Sequential(
+                conv3x3(3, embed_dim // 8, 2),
+                nn.GELU(),
+                conv3x3(embed_dim // 8, embed_dim // 4, 2),
+                nn.GELU(),
+                conv3x3(embed_dim // 4, embed_dim // 2, 2),
+                nn.GELU(),
+                conv3x3(embed_dim // 2, embed_dim, 2),
+            )
+        elif patch_size[0] == 8:
+            self.proj = paddle.nn.Sequential(
+                conv3x3(3, embed_dim // 4, 2),
+                nn.GELU(),
+                conv3x3(embed_dim // 4, embed_dim // 2, 2),
+                nn.GELU(),
+                conv3x3(embed_dim // 2, embed_dim, 2),
+            )
+        else:
+            raise ValueError("For convolutional projection, patch size has to be in [8, 16]")
+
+    def forward(self, x, padding_size=None):
+        B, C, H, W = x.shape
+        x = self.proj(x)
+        Hp, Wp = x.shape[2], x.shape[3]
+        x = x.flatten(2).transpose([0, 2, 1])
+
+        return x, (Hp, Wp)
+
+
+class LPI(nn.Layer):
+    """
+    Local Patch Interaction module that allows explicit communication between tokens in 3x3 windows
+    to augment the implicit communcation performed by the block diagonal scatter attention.
+    Implemented using 2 layers of separable 3x3 convolutions with GeLU and BatchNorm2d
+    """
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.0,
+                 kernel_size=3):
+        super().__init__()
+        out_features = out_features or in_features
+
+        padding = kernel_size // 2
+
+        self.conv1 = paddle.nn.Conv2D(
+            in_features,
+            out_features,
+            kernel_size=kernel_size,
+            padding=padding,
+            groups=out_features,
+        )
+        self.act = act_layer()
+        self.bn = nn.BatchNorm2D(in_features)
+        self.conv2 = paddle.nn.Conv2D(
+            in_features,
+            out_features,
+            kernel_size=kernel_size,
+            padding=padding,
+            groups=out_features,
+        )
+
+    def forward(self, x, H, W):
+        B, N, C = x.shape
+        x = x.transpose([0, 2, 1]).reshape([B, C, H, W])
+        x = self.conv1(x)
+        x = self.act(x)
+        x = self.bn(x)
+        x = self.conv2(x)
+        x = x.reshape([B, C, N]).transpose([0, 2, 1])
+
+        return x
+
+
+class ClassAttention(nn.Layer):
+    """Class Attention Layer as in CaiT https://arxiv.org/abs/2103.17239
+    """
+    def __init__(self,
+                 dim,
+                 num_heads=8,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0.0,
+                 proj_drop=0.0):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape([B, N, 3, self.num_heads, C // self.num_heads])
+        qkv = qkv.transpose([2, 0, 3, 1, 4])
+        # make torchscript happy (cannot use tensor as tuple)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        qc = q[:, :, 0:1]  # CLS token
+        attn_cls = (qc * k).sum(axis=-1) * self.scale
+        attn_cls = F.softmax(attn_cls, axis=-1)
+        attn_cls = self.attn_drop(attn_cls)
+
+        cls_tkn = (attn_cls.unsqueeze(2) @ v).transpose([0, 1, 3, 2]).reshape([B, 1, C])
+        cls_tkn = self.proj(cls_tkn)
+        x = paddle.concat([self.proj_drop(cls_tkn), x[:, 1:]], axis=1)
+        return x
+
+
+class ClassAttentionBlock(nn.Layer):
+    """Class Attention Layer as in CaiT https://arxiv.org/abs/2103.17239
+    """
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.0,
+                 attn_drop=0.0,
+                 drop_path=0.0,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 eta=None,
+                 tokens_norm=False):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+
+        self.attn = ClassAttention(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+        )
+
+        # LayerScale Initialization (no layerscale when None)
+        if eta is not None:
+            self.gamma1 = paddle.create_parameter(
+                shape=[dim],
+                dtype="float32",
+                default_initializer=nn.initializer.Constant(value=eta),
+            )
+            self.gamma2 = paddle.create_parameter(
+                shape=[dim],
+                dtype="float32",
+                default_initializer=nn.initializer.Constant(value=eta),
+            )
+        else:
+            self.gamma1, self.gamma2 = 1.0, 1.0
+
+        # A hack for models pre-trained with layernorm over all the tokens not just the CLS
+        self.tokens_norm = tokens_norm
+
+    def forward(self, x, H, W, mask=None):
+        x = x + self.drop_path(self.gamma1 * self.attn(self.norm1(x)))
+        if self.tokens_norm:
+            x = self.norm2(x)
+        else:
+            x[:, 0:1] = self.norm2(x[:, 0:1])
+
+        x_res = x
+        cls_token = x[:, 0:1]
+        cls_token = self.gamma2 * self.mlp(cls_token)
+        x = paddle.concat([cls_token, x[:, 1:]], axis=1)
+        x = x_res + self.drop_path(x)
+        return x
+
+
+class XCA(nn.Layer):
+    """ Cross-Covariance Attention (XCA) operation where the channels are updated using a weighted
+     sum. The weights are obtained from the (softmax normalized) Cross-covariance
+    matrix (Q^T K \\in d_h \\times d_h)
+    """
+    def __init__(self,
+                 dim,
+                 num_heads=8,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop=0.0,
+                 proj_drop=0.0):
+        super().__init__()
+        self.num_heads = num_heads
+        # self.temperature = nn.Parameter(torch.ones(num_heads, 1, 1))
+        self.temperature = paddle.create_parameter(
+            shape=[num_heads, 1, 1], dtype="float32", default_initializer=ones_
+        )
+
+        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape([B, N, 3, self.num_heads, C // self.num_heads])
+        qkv = qkv.transpose([2, 0, 3, 1, 4])
+        # make torchscript happy (cannot use tensor as tuple)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        q = q.transpose([0, 1, 3, 2])
+        k = k.transpose([0, 1, 3, 2])
+        v = v.transpose([0, 1, 3, 2])
+
+        q = paddle.nn.functional.normalize(q, axis=-1)
+        k = paddle.nn.functional.normalize(k, axis=-1)
+
+        attn = (q @ k.transpose([0, 1, 3, 2])) * self.temperature
+        attn = F.softmax(attn, axis=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose([0, 3, 1, 2]).reshape([B, N, C])
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class XCABlock(nn.Layer):
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 mlp_ratio=4.0,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.0,
+                 attn_drop=0.0,
+                 drop_path=0.0,
+                 act_layer=nn.GELU,
+                 norm_layer=nn.LayerNorm,
+                 num_tokens=196,
+                 eta=None):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = XCA(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else Identity()
+        self.norm2 = norm_layer(dim)
+
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+        )
+
+        self.norm3 = norm_layer(dim)
+        self.local_mp = LPI(in_features=dim, act_layer=act_layer)
+
+        self.gamma1 = paddle.create_parameter(
+            shape=[dim],
+            dtype="float32",
+            default_initializer=nn.initializer.Constant(value=eta),
+        )
+        self.gamma2 = paddle.create_parameter(
+            shape=[dim],
+            dtype="float32",
+            default_initializer=nn.initializer.Constant(value=eta),
+        )
+        self.gamma3 = paddle.create_parameter(
+            shape=[dim],
+            dtype="float32",
+            default_initializer=nn.initializer.Constant(value=eta),
+        )
+
+        # self.gamma1 = nn.Parameter(eta * torch.ones(dim), requires_grad=True)
+        # self.gamma2 = nn.Parameter(eta * torch.ones(dim), requires_grad=True)
+        # self.gamma3 = nn.Parameter(eta * torch.ones(dim), requires_grad=True)
+
+    def forward(self, x, H, W):
+        x = x + self.drop_path(self.gamma1 * self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.gamma3 * self.local_mp(self.norm3(x), H, W))
+        x = x + self.drop_path(self.gamma2 * self.mlp(self.norm2(x)))
+        return x
+
+
+class XCiT(nn.Layer):
+    """
+    Based on timm and DeiT code bases
+    https://github.com/rwightman/pytorch-image-models/tree/master/timm
+    https://github.com/facebookresearch/deit/
+    """
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=12,
+                 num_heads=12,
+                 mlp_ratio=4.0,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop_rate=0.0,
+                 attn_drop_rate=0.0,
+                 drop_path_rate=0.0,
+                 norm_layer=partial(nn.LayerNorm, epsilon=1e-6),
+                 cls_attn_layers=2,
+                 use_pos=True,
+                 patch_proj="linear",
+                 eta=None,
+                 tokens_norm=False):
+        """
+        Args:
+            img_size (int, tuple): input image size
+            patch_size (int, tuple): patch size
+            in_chans (int): number of input channels
+            num_classes (int): number of classes for classification head
+            embed_dim (int): embedding dimension
+            depth (int): depth of transformer
+            num_heads (int): number of attention heads
+            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
+            qkv_bias (bool): enable bias for qkv if True
+            qk_scale (float): override default qk scale of head_dim ** -0.5 if set
+            drop_rate (float): dropout rate
+            attn_drop_rate (float): attention dropout rate
+            drop_path_rate (float): stochastic depth rate
+            norm_layer: (nn.Module): normalization layer
+            cls_attn_layers: (int) Depth of Class attention layers
+            use_pos: (bool) whether to use positional encoding
+            eta: (float) layerscale initialization value
+            tokens_norm: (bool) Whether to normalize all tokens or just the cls_token in the CA
+        """
+        super().__init__()
+        self.num_classes = num_classes
+        self.num_features = self.embed_dim = embed_dim
+        norm_layer = norm_layer or partial(nn.LayerNorm, epsilson=1e-6)
+
+        self.patch_embed = ConvPatchEmbed(
+            img_size=img_size, embed_dim=embed_dim, patch_size=patch_size
+        )
+
+        num_patches = self.patch_embed.num_patches
+
+        # self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.cls_token = paddle.create_parameter(
+            shape=[1, 1, embed_dim], dtype="float32", default_initializer=trunc_normal_
+        )
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        dpr = [drop_path_rate for i in range(depth)]
+        self.blocks = nn.LayerList(
+            [
+                XCABlock(
+                    dim=embed_dim,
+                    num_heads=num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    qk_scale=qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=dpr[i],
+                    norm_layer=norm_layer,
+                    num_tokens=num_patches,
+                    eta=eta,
+                )
+                for i in range(depth)
+            ]
+        )
+
+        self.cls_attn_blocks = nn.LayerList(
+            [
+                ClassAttentionBlock(
+                    dim=embed_dim,
+                    num_heads=num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    qk_scale=qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    norm_layer=norm_layer,
+                    eta=eta,
+                    tokens_norm=tokens_norm,
+                )
+                for i in range(cls_attn_layers)
+            ]
+        )
+        self.norm = norm_layer(embed_dim)
+        self.pos_embeder = PositionalEncodingFourier(dim=embed_dim)
+        self.use_pos = use_pos
+        self.head = (
+            nn.Linear(self.num_features, num_classes) if num_classes > 0 else Identity()
+        )
+
+        # Classifier head
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                zeros_(m.bias)
+        elif isinstance(m, nn.LayerNorm):
+            zeros_(m.bias)
+            ones_(m.weight)
+
+    def forward_features(self, x):
+        B = x.shape[0]
+
+        x, (Hp, Wp) = self.patch_embed(x)
+
+        if self.use_pos:
+            pos_encoding = (
+                self.pos_embeder(B, Hp, Wp)
+                .reshape([B, -1, x.shape[1]])
+                .transpose([0, 2, 1])
+            )
+            x = x + pos_encoding
+
+        x = self.pos_drop(x)
+
+        for blk in self.blocks:
+            x = blk(x, Hp, Wp)
+
+        cls_tokens = self.cls_token.expand([B, -1, -1])
+        x = paddle.concat((cls_tokens, x), axis=1)
+
+        for blk in self.cls_attn_blocks:
+            x = blk(x, Hp, Wp)
+
+        x = self.norm(x)[:, 0]
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(x)
+
+        return x
+
+
+def build_xcit(config):
+    model = XCiT(
+        img_size=config.DATA.IMAGE_SIZE,
+        patch_size=config.MODEL.TRANS.PATCH_SIZE,
+        embed_dim=config.MODEL.TRANS.EMBED_DIM,
+        num_classes=config.MODEL.NUM_CLASSES,
+        depth=config.MODEL.TRANS.DEPTH,
+        num_heads=config.MODEL.TRANS.NUM_HEADS,
+        eta=config.MODEL.TRANS.ETA,
+        tokens_norm=config.MODEL.TRANS.TOKENS_NORM,
+    )
+    return model
diff --git a/image_classification/__init__.py b/image_classification/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/gMLP/README.md b/image_classification/gMLP/README.md
index 7c759ebb..90559de6 100644
--- a/image_classification/gMLP/README.md
+++ b/image_classification/gMLP/README.md
@@ -14,12 +14,13 @@ This implementation is developed by [PaddleViT](https://github.com/BR-IDL/Paddle
 
 
 ### Update 
-Update (2021-08-11): Code is released and ported weights are uploaded.
+- Update (2021-09-27): Model FLOPs and # params are uploaded.
+- Update (2021-08-11): Code is released and ported weights are uploaded.
 
 ## Models Zoo
-| Model                          | Acc@1 | Acc@5 | Image Size | Crop_pct | Interpolation | Link        |
-|--------------------------------|-------|-------|------------|----------|---------------|--------------|
-| gmlp_s16_224                   | 79.64 | 94.63 | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TLypFly7aW0oXzEHfeDSz2Va4RHPRqe5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13UUz1eGIKyqyhtwedKLUMA)(bcth) |
+| Model                         | Acc@1 | Acc@5 | #Params | FLOPs  | Image Size | Crop_pct | Interpolation | Link         |
+|-------------------------------|-------|-------|---------|--------|------------|----------|---------------|--------------|
+| gmlp_s16_224                 	| 79.64 | 94.63 | 19.4M   | 4.5G   | 224        | 0.875    | bicubic       | [google](https://drive.google.com/file/d/1TLypFly7aW0oXzEHfeDSz2Va4RHPRqe5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/13UUz1eGIKyqyhtwedKLUMA)(bcth) |
 
 > *The results are evaluated on ImageNet2012 validation set.
 > 
@@ -65,8 +66,8 @@ from gmlp import build_gated_mlp as build_model
 config = get_config('./configs/gmlp_s16_224.yaml')
 # build model
 model = build_model(config)
-# load pretrained weights, .pdparams is NOT needed
-model_state_dict = paddle.load('./gmlp_s16_224')
+# load pretrained weights
+model_state_dict = paddle.load('./gmlp_s16_224.pdparams')
 model.set_dict(model_state_dict)
 ```
 
@@ -79,12 +80,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-    -cfg='./configs/gmlp_s16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/gmlp_s16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./gmlp_s16_224'
+    -pretrained=/path/to/pretrained/model/gmlp_s16_224  # .pdparams is NOT needed
 ```
 
 <details>
@@ -101,12 +102,12 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/gmlp_s16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/gmlp_s16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/val \
     -eval \
-    -pretrained='./gmlp_s16_224'
+    -pretrained=/path/to/pretrained/model/gmlp_s16_224  # .pdparams is NOT needed
 ```
 
 </details>
@@ -120,10 +121,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 python main_single_gpu.py \
-  -cfg='./configs/gmlp_s16_224.yaml' \
-  -dataset='imagenet2012' \
+  -cfg=./configs/gmlp_s16_224.yaml \
+  -dataset=imagenet2012 \
   -batch_size=32 \
-  -data_path='/dataset/imagenet' \
+  -data_path=/path/to/dataset/imagenet/train
 ```
 
 <details>
@@ -140,10 +141,10 @@ or
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
-    -cfg='./configs/gmlp_s16_224.yaml' \
-    -dataset='imagenet2012' \
+    -cfg=./configs/gmlp_s16_224.yaml \
+    -dataset=imagenet2012 \
     -batch_size=16 \
-    -data_path='/dataset/imagenet' \
+    -data_path=/path/to/dataset/imagenet/train
 ```
 
 </details>
diff --git a/image_classification/gMLP/__init__.py b/image_classification/gMLP/__init__.py
new file mode 100644
index 00000000..e2cbd538
--- /dev/null
+++ b/image_classification/gMLP/__init__.py
@@ -0,0 +1 @@
+#init
diff --git a/image_classification/gMLP/augment.py b/image_classification/gMLP/augment.py
new file mode 100644
index 00000000..7a7f081c
--- /dev/null
+++ b/image_classification/gMLP/augment.py
@@ -0,0 +1,285 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Augmentation"""
+""" Rand Augmentation """
+# reference: RandAugment: Practical automated data augmentation with a reduced search space
+# https://arxiv.org/abs/1909.13719
+
+""" Auto Augmentation """
+# reference: AutoAugment: Learning Augmentation Policies from Data
+# https://arxiv.org/abs/1805.09501
+
+import random
+import numpy as np
+from PIL import Image, ImageEnhance, ImageOps
+
+
+def auto_augment_policy_original():
+    """25 types of augment policies in original paper"""
+    policy = [
+        [('Posterize', 0.4, 8), ('Rotate', 0.6, 9)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+        [('Posterize', 0.6, 7), ('Posterize', 0.6, 6)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Equalize', 0.4, 4), ('Rotate', 0.8, 8)],
+        [('Solarize', 0.6, 3), ('Equalize', 0.6, 7)],
+        [('Posterize', 0.8, 5), ('Equalize', 1.0, 2)],
+        [('Rotate', 0.2, 3), ('Solarize', 0.6, 8)],
+        [('Equalize', 0.6, 8), ('Posterize', 0.4, 6)],
+        [('Rotate', 0.8, 8), ('Color', 0.4, 0)],
+        [('Rotate', 0.4, 9), ('Equalize', 0.6, 2)],
+        [('Equalize', 0.0, 7), ('Equalize', 0.8, 8)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Rotate', 0.8, 8), ('Color', 1.0, 2)],
+        [('Color', 0.8, 8), ('Solarize', 0.8, 7)],
+        [('Sharpness', 0.4, 7), ('Invert', 0.6, 8)],
+        [('ShearX', 0.6, 5), ('Equalize', 1.0, 9)],
+        [('Color', 0.4, 0), ('Equalize', 0.6, 3)],
+        [('Equalize', 0.4, 7), ('Solarize', 0.2, 4)],
+        [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
+        [('Invert', 0.6, 4), ('Equalize', 1.0, 8)],
+        [('Color', 0.6, 4), ('Contrast', 1.0, 8)],
+        [('Equalize', 0.8, 8), ('Equalize', 0.6, 3)],
+    ]
+    policy = [[SubPolicy(*args) for args in subpolicy] for subpolicy in policy]
+    return policy
+
+
+def rand_augment_policy_original(magnitude_idx=9):
+    """
+    14 types of augment policies in original paper
+    Args:
+        magnitude_idx: M
+    """
+    policy = [
+        ('Posterize', 1, magnitude_idx), ('Rotate', 1, magnitude_idx),
+        ('Solarize', 1, magnitude_idx), ('AutoContrast', 1, magnitude_idx),
+        ('Equalize', 1, magnitude_idx), ('Contrast', 1, magnitude_idx),
+        ('Color', 1, magnitude_idx), ('Invert', 1, magnitude_idx),
+        ('Sharpness', 1, magnitude_idx), ('Brightness', 1, magnitude_idx),
+        ('ShearX', 1, magnitude_idx), ('ShearY', 1, magnitude_idx),
+        ('TranslateX', 1, magnitude_idx), ('TranslateY', 1, magnitude_idx),
+    ]
+    policy = [SubPolicy(*args) for args in policy]
+    return policy
+
+
+class AutoAugment():
+    """Auto Augment
+    Randomly choose a tuple of augment ops from a list of policy
+    Then apply the tuple of augment ops to input image
+
+    Examples:
+        policy = auto_augment_policy_original()
+        augment = AutoAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy):
+        self.policy = policy
+
+    def __call__(self, image, policy_idx=None):
+        if policy_idx is None:
+            policy_idx = random.randint(0, len(self.policy) - 1)
+
+        sub_policy = self.policy[policy_idx]
+        for op in sub_policy:
+            image = op(image)
+        return image
+
+
+class RandAugment():
+    """Rand Augment
+    Randomly choose N augment ops from a list of K policies
+    Then apply the N ops to input image
+
+    Examples:
+        policy = rand_augment_policy_original(magnitude_idx)
+        augment = RandAugment(policy)
+        transformed_image = augment(image)
+    """
+
+    def __init__(self, policy, num_layers=2):
+        """
+        Args:
+            policy: list of SubPolicy
+            num_layers: int
+        """
+        self.policy = policy
+        self.num_layers = num_layers
+
+    def __call__(self, image):
+        selected_idx = np.random.choice(len(self.policy), self.num_layers)
+
+        for policy_idx in selected_idx:
+            sub_policy = self.policy[policy_idx]
+            image = sub_policy(image)
+        return image
+
+
+class SubPolicy:
+    """Subpolicy
+    Read augment name and magnitude, apply augment with probability
+    Args:
+        op_name: str, augment operation name
+        prob: float, if prob > random prob, apply augment
+        magnitude_idx: int, index of magnitude in preset magnitude ranges
+    """
+
+    def __init__(self, op_name, prob, magnitude_idx):
+        # ranges of operations' magnitude
+        ranges = {
+            'ShearX': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'ShearY': np.linspace(0, 0.3, 10),  # [-0.3, 0.3] (by random negative)
+            'TranslateX': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'TranslateY': np.linspace(0, 150 / 331, 10),  # [-0.45, 0.45] (by random negative)
+            'Rotate': np.linspace(0, 30, 10),  # [-30, 30] (by random negative)
+            'Color': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Posterize': np.round(np.linspace(8, 4, 10), 0).astype(np.int),  # [0, 4]
+            'Solarize': np.linspace(256, 0, 10),  # [0, 256]
+            'Contrast': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Sharpness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'Brightness': np.linspace(0, 0.9, 10),  # [-0.9, 0.9] (by random negative)
+            'AutoContrast': [0] * 10,  # no range
+            'Equalize': [0] * 10,  # no range
+            'Invert': [0] * 10,  # no range
+        }
+
+        # augmentation operations
+        # Lambda is not pickleable for DDP
+        # image_ops = {
+        #    'ShearX': lambda image, magnitude: shear_x(image, magnitude),
+        #    'ShearY': lambda image, magnitude: shear_y(image, magnitude),
+        #    'TranslateX': lambda image, magnitude: translate_x(image, magnitude),
+        #    'TranslateY': lambda image, magnitude: translate_y(image, magnitude),
+        #    'Rotate': lambda image, magnitude: rotate(image, magnitude),
+        #    'AutoContrast': lambda image, magnitude: auto_contrast(image, magnitude),
+        #    'Invert': lambda image, magnitude: invert(image, magnitude),
+        #    'Equalize': lambda image, magnitude: equalize(image, magnitude),
+        #    'Solarize': lambda image, magnitude: solarize(image, magnitude),
+        #    'Posterize': lambda image, magnitude: posterize(image, magnitude),
+        #    'Contrast': lambda image, magnitude: contrast(image, magnitude),
+        #    'Color': lambda image, magnitude: color(image, magnitude),
+        #    'Brightness': lambda image, magnitude: brightness(image, magnitude),
+        #    'Sharpness': lambda image, magnitude: sharpness(image, magnitude),
+        # }
+        image_ops = {
+            'ShearX': shear_x,
+            'ShearY': shear_y,
+            'TranslateX': translate_x_relative,
+            'TranslateY': translate_y_relative,
+            'Rotate': rotate,
+            'AutoContrast': auto_contrast,
+            'Invert': invert,
+            'Equalize': equalize,
+            'Solarize': solarize,
+            'Posterize': posterize,
+            'Contrast': contrast,
+            'Color': color,
+            'Brightness': brightness,
+            'Sharpness': sharpness,
+        }
+
+        self.prob = prob
+        self.magnitude = ranges[op_name][magnitude_idx]
+        self.op = image_ops[op_name]
+
+    def __call__(self, image):
+        if self.prob > random.random():
+            image = self.op(image, self.magnitude)
+        return image
+
+
+# PIL Image transforms
+# https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.transform
+def shear_x(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), fillcolor=fillcolor)
+
+
+def shear_y(image, magnitude, fillcolor=(128, 128, 128)):
+    factor = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), fillcolor=fillcolor)
+
+
+def translate_x_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_relative(image, magnitude, fillcolor=(128, 128, 128)):
+    pixels = magnitude * image.size[0]
+    pixels = pixels * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), fillcolor=fillcolor)
+
+
+def translate_x_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, magnitude, 0, 1, 0), fillcolor=fillcolor)
+
+
+def translate_y_absolute(image, magnitude, fillcolor=(128, 128, 128)):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return image.transform(image.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude), fillcolor=fillcolor)
+
+
+def rotate(image, magnitude):
+    rot = image.convert("RGBA").rotate(magnitude)
+    return Image.composite(rot,
+                           Image.new('RGBA', rot.size, (128,) * 4),
+                           rot).convert(image.mode)
+
+
+def auto_contrast(image, magnitude=None):
+    return ImageOps.autocontrast(image)
+
+
+def invert(image, magnitude=None):
+    return ImageOps.invert(image)
+
+
+def equalize(image, magnitude=None):
+    return ImageOps.equalize(image)
+
+
+def solarize(image, magnitude):
+    return ImageOps.solarize(image, magnitude)
+
+
+def posterize(image, magnitude):
+    return ImageOps.posterize(image, magnitude)
+
+
+def contrast(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Contrast(image).enhance(1 + magnitude)
+
+
+def color(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Color(image).enhance(1 + magnitude)
+
+
+def brightness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Brightness(image).enhance(1 + magnitude)
+
+
+def sharpness(image, magnitude):
+    magnitude = magnitude * random.choice([-1, 1])  # random negative
+    return ImageEnhance.Sharpness(image).enhance(1 + magnitude)
+
diff --git a/image_classification/gMLP/config.py b/image_classification/gMLP/config.py
index b6db78e0..4c6e755d 100644
--- a/image_classification/gMLP/config.py
+++ b/image_classification/gMLP/config.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -34,7 +34,9 @@
 _C.DATA.DATASET = 'imagenet2012' # dataset name
 _C.DATA.IMAGE_SIZE = 224 # input image size: 224 for pretrain, 384 for finetune
 _C.DATA.CROP_PCT = 1.0 # input image scale ratio, scale is applied before centercrop in eval mode
-_C.DATA.NUM_WORKERS = 4 # number of data loading threads 
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
+_C.DATA.IMAGENET_MEAN = [0.485, 0.456, 0.406] # [0.5, 0.5, 0.5]
+_C.DATA.IMAGENET_STD = [0.229, 0.224, 0.225] # [0.5, 0.5, 0.5]
 
 # model settings
 _C.MODEL = CN()
@@ -43,26 +45,29 @@
 _C.MODEL.RESUME = None
 _C.MODEL.PRETRAINED = None
 _C.MODEL.NUM_CLASSES = 1000
-_C.MODEL.DROPOUT = 0.1
-_C.MODEL.DROPPATH = 0.1
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.1
 
 # transformer settings
 _C.MODEL.MIXER = CN()
 _C.MODEL.MIXER.PATCH_SIZE = 16
 _C.MODEL.MIXER.HIDDEN_SIZE = 256
 _C.MODEL.MIXER.NUM_LAYERS = 30
+_C.MODEL.MIXER.MLP_RATIO = 6.0
 
 # training settings
 _C.TRAIN = CN()
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.NUM_EPOCHS = 300
-_C.TRAIN.WARMUP_EPOCHS = 3 #34 # ~ 10k steps for 4096 batch size
-_C.TRAIN.WEIGHT_DECAY = 0.01 #0.3 # 0.0 for finetune
-_C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
-_C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
-_C.TRAIN.END_LR = 1e-5
-_C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 5e-7
+_C.TRAIN.END_LR = 5e-6
+_C.TRAIN.GRAD_CLIP = 5.0
+_C.TRAIN.ACCUM_ITER = 1
+_C.TRAIN.LINEAR_SCALED_LR = None
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -76,6 +81,24 @@
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
 
+# train augmentation
+_C.TRAIN.MIXUP_ALPHA = 0.8
+_C.TRAIN.CUTMIX_ALPHA = 1.0
+_C.TRAIN.CUTMIX_MINMAX = None
+_C.TRAIN.MIXUP_PROB = 1.0
+_C.TRAIN.MIXUP_SWITCH_PROB = 0.5
+_C.TRAIN.MIXUP_MODE = 'batch'
+
+_C.TRAIN.SMOOTHING = 0.1
+_C.TRAIN.COLOR_JITTER = 0.4
+_C.TRAIN.AUTO_AUGMENT = False #'rand-m9-mstd0.5-inc1'
+_C.TRAIN.RAND_AUGMENT = False
+
+_C.TRAIN.RANDOM_ERASE_PROB = 0.25
+_C.TRAIN.RANDOM_ERASE_MODE = 'pixel'
+_C.TRAIN.RANDOM_ERASE_COUNT = 1
+_C.TRAIN.RANDOM_ERASE_SPLIT = False
+
 # misc
 _C.SAVE = "./output"
 _C.TAG = "default"
@@ -84,8 +107,9 @@
 _C.VALIDATE_FREQ = 20 # freq to do validation
 _C.SEED = 0
 _C.EVAL = False # run evaluation only
+_C.AMP = False # mix precision training
 _C.LOCAL_RANK = 0
-_C.NGPUS = 1
+_C.NGPUS = -1
 
 
 def _update_config_from_file(config, cfg_file):
@@ -117,8 +141,12 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.image_size:
         config.DATA.IMAGE_SIZE = args.image_size
+    if args.num_classes:
+        config.MODEL.NUM_CLASSES = args.num_classes
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.output is not None:
+        config.SAVE = args.output
     if args.ngpus:
         config.NGPUS = args.ngpus
     if args.eval:
@@ -130,6 +158,11 @@ def update_config(config, args):
         config.MODEL.RESUME = args.resume
     if args.last_epoch:
         config.TRAIN.LAST_EPOCH = args.last_epoch
+    if args.amp: # only during training
+        if config.EVAL is True:
+            config.AMP = False
+        else:
+            config.AMP = True
 
     #config.freeze()
     return config
diff --git a/image_classification/gMLP/datasets.py b/image_classification/gMLP/datasets.py
index e207f9ba..304df9a3 100644
--- a/image_classification/gMLP/datasets.py
+++ b/image_classification/gMLP/datasets.py
@@ -19,8 +19,20 @@
 
 import os
 import math
-from paddle.io import Dataset, DataLoader, DistributedBatchSampler
-from paddle.vision import transforms, datasets, image_load
+from PIL import Image
+from paddle.io import Dataset
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle.vision import transforms
+from paddle.vision import datasets
+from paddle.vision import image_load
+from augment import auto_augment_policy_original
+from augment import AutoAugment
+from augment import rand_augment_policy_original
+from augment import RandAugment
+from transforms import RandomHorizontalFlip
+from random_erasing import RandomErasing
+
 
 class ImageNet2012Dataset(Dataset):
     """Build ImageNet2012 dataset
@@ -80,13 +92,36 @@ def get_train_transforms(config):
         transforms_train: training transforms
     """
 
-    transforms_train = transforms.Compose([
+    aug_op_list = []
+    # STEP1: random crop and resize
+    aug_op_list.append(
         transforms.RandomResizedCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE),
-                                     scale=(0.05, 1.0)),
-        transforms.ToTensor(),
-        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        #transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-    ])
+                                     scale=(0.05, 1.0), interpolation='bicubic'))
+    # STEP2: auto_augment or color jitter
+    if config.TRAIN.AUTO_AUGMENT:
+        policy = auto_augment_policy_original()
+        auto_augment = AutoAugment(policy)
+        aug_op_list.append(auto_augment)
+    elif config.TRAIN.RAND_AUGMENT:
+        policy = rand_augment_policy_original()
+        rand_augment = RandAugment(policy)
+        aug_op_list.append(rand_augment)
+    else:
+        jitter = (float(config.TRAIN.COLOR_JITTER), ) * 3
+        aug_op_list.append(transforms.ColorJitter(*jitter))
+    # STEP3: other ops
+    aug_op_list.append(transforms.ToTensor())
+    aug_op_list.append(transforms.Normalize(mean=config.DATA.IMAGENET_MEAN,
+                                            std=config.DATA.IMAGENET_STD))
+    # STEP4: random erasing
+    if config.TRAIN.RANDOM_ERASE_PROB > 0.:
+        random_erasing = RandomErasing(prob=config.TRAIN.RANDOM_ERASE_PROB,
+                                       mode=config.TRAIN.RANDOM_ERASE_MODE,
+                                       max_count=config.TRAIN.RANDOM_ERASE_COUNT,
+                                       num_splits=config.TRAIN.RANDOM_ERASE_SPLIT)
+        aug_op_list.append(random_erasing)
+    # Final: compose transforms and return
+    transforms_train = transforms.Compose(aug_op_list)
     return transforms_train
 
 
@@ -106,11 +141,10 @@ def get_val_transforms(config):
 
     scale_size = int(math.floor(config.DATA.IMAGE_SIZE / config.DATA.CROP_PCT))
     transforms_val = transforms.Compose([
-        transforms.Resize(scale_size, 'bicubic'), # single int for resize shorter side of image
+        transforms.Resize(scale_size, interpolation='bicubic'),
         transforms.CenterCrop((config.DATA.IMAGE_SIZE, config.DATA.IMAGE_SIZE)),
         transforms.ToTensor(),
-        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        #transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+        transforms.Normalize(mean=config.DATA.IMAGENET_MEAN, std=config.DATA.IMAGENET_STD),
     ])
     return transforms_val
 
@@ -125,6 +159,7 @@ def get_dataset(config, mode='train'):
     Returns:
         dataset: dataset object
     """
+
     assert mode in ['train', 'val']
     if config.DATA.DATASET == "cifar10":
         if mode == 'train':
diff --git a/image_classification/gMLP/droppath.py b/image_classification/gMLP/droppath.py
index fcff05e9..c8fe8048 100644
--- a/image_classification/gMLP/droppath.py
+++ b/image_classification/gMLP/droppath.py
@@ -32,6 +32,7 @@ def drop_path(inputs, drop_prob=0., training=False):
     if drop_prob == 0. or not training:
         return inputs
     keep_prob = 1 - drop_prob
+    keep_prob = paddle.to_tensor(keep_prob)
     shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
     random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
     random_tensor = random_tensor.floor() # mask
diff --git a/image_classification/gMLP/gmlp.py b/image_classification/gMLP/gmlp.py
index 47686075..25d8c5d8 100644
--- a/image_classification/gMLP/gmlp.py
+++ b/image_classification/gMLP/gmlp.py
@@ -198,7 +198,7 @@ def build_gated_mlp(config):
                      in_channels=3,
                      num_mixer_layers=config.MODEL.MIXER.NUM_LAYERS,
                      embed_dim=config.MODEL.MIXER.HIDDEN_SIZE,
-                     mlp_ratio=6,
+                     mlp_ratio=config.MODEL.MIXER.MLP_RATIO,
                      dropout=config.MODEL.DROPOUT,
-                     droppath=config.MODEL.DROPPATH)
+                     droppath=config.MODEL.DROP_PATH)
     return model
diff --git a/image_classification/gMLP/losses.py b/image_classification/gMLP/losses.py
new file mode 100644
index 00000000..082467a3
--- /dev/null
+++ b/image_classification/gMLP/losses.py
@@ -0,0 +1,123 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Implement Loss functions """
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class LabelSmoothingCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for label smoothing
+    Args:
+        smoothing: float, smoothing rate
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, target label with shape [N]
+    Return:
+        loss: float, cross entropy loss value
+    """
+    def __init__(self, smoothing=0.1):
+        super().__init__()
+        assert 0 <= smoothing < 1.0
+        self.smoothing = smoothing
+        self.confidence = 1 - smoothing
+
+    def forward(self, x, target):
+        log_probs = F.log_softmax(x) # [N, num_classes]
+        # target_index is used to get prob for each of the N samples
+        target_index = paddle.zeros([x.shape[0], 2], dtype='int64') # [N, 2]
+        target_index[:, 0] = paddle.arange(x.shape[0])
+        target_index[:, 1] = target
+
+        nll_loss = -log_probs.gather_nd(index=target_index) # index: [N]
+        smooth_loss = -log_probs.mean(axis=-1)
+        loss = self.confidence * nll_loss + self.smoothing * smooth_loss
+        return loss.mean()
+
+
+class SoftTargetCrossEntropyLoss(nn.Layer):
+    """ cross entropy loss for soft target
+    Args:
+        x: tensor, predictions (before softmax) with shape [N, num_classes]
+        target: tensor, soft target with shape [N, num_classes]
+    Returns:
+        loss: float, the mean loss value
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x, target):
+        loss = paddle.sum(-target * F.log_softmax(x, axis=-1), axis=-1)
+        return loss.mean()
+
+
+class DistillationLoss(nn.Layer):
+    """Distillation loss function
+    This layer includes the orginal loss (criterion) and a extra 
+    distillation loss (criterion), which computes the loss with 
+    different type options, between current model and 
+    a teacher model as its supervision.
+
+    Args:
+        base_criterion: nn.Layer, the original criterion
+        teacher_model: nn.Layer, the teacher model as supervision
+        distillation_type: str, one of ['none', 'soft', 'hard']
+        alpha: float, ratio of base loss (* (1-alpha)) 
+               and distillation loss( * alpha)
+        tao: float, temperature in distillation
+    """
+    def __init__(self,
+                 base_criterion,
+                 teacher_model,
+                 distillation_type,
+                 alpha,
+                 tau):
+        super().__init__()
+        assert distillation_type in ['none', 'soft', 'hard']
+        self.base_criterion = base_criterion
+        self.teacher_model = teacher_model
+        self.type = distillation_type
+        self.alpha = alpha
+        self.tau = tau
+
+    def forward(self, inputs, outputs, targets):
+        """
+        Args:
+            inputs: tensor, the orginal model inputs
+            outputs: tensor, the outputs of the model
+            outputds_kd: tensor, the distillation outputs of the model,
+                         this is usually obtained by a separate branch
+                         in the last layer of the model
+            targets: tensor, the labels for the base criterion
+        """
+        outputs, outputs_kd = outputs[0], outputs[1]
+        base_loss = self.base_criterion(outputs, targets)
+        if self.type == 'none':
+            return base_loss
+
+        with paddle.no_grad():
+            teacher_outputs = self.teacher_model(inputs)
+
+        if self.type == 'soft':
+            distillation_loss = F.kl_div(
+                F.log_softmax(outputs_kd / self.tau, axis=1),
+                F.log_softmax(teacher_outputs / self.tau, axis=1),
+                reduction='sum') * (self.tau * self.tau) / outputs_kd.numel()
+        elif self.type == 'hard':
+            distillation_loss = F.cross_entropy(outputs_kd, teacher_outputs.argmax(axis=1))
+
+        loss = base_loss * (1 - self.alpha) + distillation_loss * self.alpha
+        return loss
+
+
diff --git a/image_classification/gMLP/main_multi_gpu.py b/image_classification/gMLP/main_multi_gpu.py
index 4189e737..436ce98b 100644
--- a/image_classification/gMLP/main_multi_gpu.py
+++ b/image_classification/gMLP/main_multi_gpu.py
@@ -1,4 +1,4 @@
-#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -25,54 +25,55 @@
 import paddle.nn as nn
 import paddle.nn.functional as F
 import paddle.distributed as dist
-from datasets import get_dataloader, get_dataset
-from gmlp import build_gated_mlp as build_model
+from datasets import get_dataloader
+from datasets import get_dataset
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from gmlp import build_gated_mlp as build_model
 
 
-parser = argparse.ArgumentParser('gMLP')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-arguments = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, arguments)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('gMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -80,18 +81,28 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          local_logger=None,
+          master_logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
         train_loss_meter.avg
         train_acc_meter.avg
@@ -100,63 +111,120 @@ def train(dataloader,
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+    master_train_loss_meter = AverageMeter()
+    master_train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #
-        #loss =  loss / accum_iter
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
-        loss.backward()
+        pred = F.softmax(output)
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+        batch_size = paddle.to_tensor(image.shape[0])
 
-        pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        # sync from other gpus for overall loss and acc
+        master_loss = loss.clone()
+        master_acc = acc.clone()
+        master_batch_size = batch_size.clone()
+        dist.all_reduce(master_loss)
+        dist.all_reduce(master_acc)
+        dist.all_reduce(master_batch_size)
+        master_loss = master_loss / dist.get_world_size()
+        master_acc = master_acc / dist.get_world_size()
+        master_train_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+        master_train_acc_meter.update(master_acc.numpy()[0], master_batch_size.numpy()[0])
 
-        batch_size = image.shape[0]
-        train_loss_meter.update(loss.numpy()[0], batch_size)
-        train_acc_meter.update(acc.numpy()[0], batch_size)
+        train_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
+        train_acc_meter.update(acc.numpy()[0], batch_size.numpy()[0])
 
         if batch_id % debug_steps == 0:
-            logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                f"Step[{batch_id:04d}/{total_batch:04d}], " +
-                f"Avg Loss: {train_loss_meter.avg:.4f}, " +
-                f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if local_logger:
+                local_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {train_acc_meter.avg:.4f}")
+            if master_logger and dist.get_rank() == 0:
+                master_logger.info(
+                    f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
+                    f"Step[{batch_id:04d}/{total_batch:04d}], " +
+                    f"Avg Loss: {master_train_loss_meter.avg:.4f}, " +
+                    f"Avg Acc: {master_train_acc_meter.avg:.4f}")
 
     train_time = time.time() - time_st
-    return train_loss_meter.avg, train_acc_meter.avg, train_time
-
-
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+    return (train_loss_meter.avg,
+            train_acc_meter.avg,
+            master_train_loss_meter.avg,
+            master_train_acc_meter.avg,
+            train_time)
+
+
+def validate(dataloader,
+             model,
+             criterion,
+             total_batch,
+             debug_steps=100,
+             local_logger=None,
+             master_logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        debug_steps: int, num of iters to log info, default: 100
+        local_logger: logger for local process/gpu, default: None
+        master_logger: logger for main process, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        master_val_loss_meter.avg: float, average loss on all processes/gpus
+        master_val_acc1_meter.avg: float, average top1 accuracy on all processes/gpus
+        master_val_acc5_meter.avg: float, average top5 accuracy on all processes/gpus
+        val_time: float, validation time
     """
     model.eval()
     val_loss_meter = AverageMeter()
     val_acc1_meter = AverageMeter()
     val_acc5_meter = AverageMeter()
+    master_val_loss_meter = AverageMeter()
+    master_val_acc1_meter = AverageMeter()
+    master_val_acc5_meter = AverageMeter()
     time_st = time.time()
 
     with paddle.no_grad():
@@ -171,56 +239,140 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             acc1 = paddle.metric.accuracy(pred, label.unsqueeze(1))
             acc5 = paddle.metric.accuracy(pred, label.unsqueeze(1), k=5)
 
-            dist.all_reduce(loss)
-            dist.all_reduce(acc1)
-            dist.all_reduce(acc5)
-            loss = loss / dist.get_world_size()
-            acc1 = acc1 / dist.get_world_size()
-            acc5 = acc5 / dist.get_world_size()
-
             batch_size = paddle.to_tensor(image.shape[0])
-            dist.all_reduce(batch_size)
+
+            master_loss = loss.clone()
+            master_acc1 = acc1.clone()
+            master_acc5 = acc5.clone()
+            master_batch_size = batch_size.clone()
+
+            dist.all_reduce(master_loss)
+            dist.all_reduce(master_acc1)
+            dist.all_reduce(master_acc5)
+            dist.all_reduce(master_batch_size)
+            master_loss = master_loss / dist.get_world_size()
+            master_acc1 = master_acc1 / dist.get_world_size()
+            master_acc5 = master_acc5 / dist.get_world_size()
+
+            master_val_loss_meter.update(master_loss.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc1_meter.update(master_acc1.numpy()[0], master_batch_size.numpy()[0])
+            master_val_acc5_meter.update(master_acc5.numpy()[0], master_batch_size.numpy()[0])
 
             val_loss_meter.update(loss.numpy()[0], batch_size.numpy()[0])
             val_acc1_meter.update(acc1.numpy()[0], batch_size.numpy()[0])
             val_acc5_meter.update(acc5.numpy()[0], batch_size.numpy()[0])
 
             if batch_id % debug_steps == 0:
-                logger.info(
-                    f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
-                    f"Avg Loss: {val_loss_meter.avg:.4f}, " +
-                    f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
-                    f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
-
+                if local_logger:
+                    local_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {val_acc5_meter.avg:.4f}")
+                if master_logger and dist.get_rank() == 0:
+                    master_logger.info(
+                        f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
+                        f"Avg Loss: {master_val_loss_meter.avg:.4f}, " +
+                        f"Avg Acc@1: {master_val_acc1_meter.avg:.4f}, " +
+                        f"Avg Acc@5: {master_val_acc5_meter.avg:.4f}")
     val_time = time.time() - time_st
-    return val_loss_meter.avg, val_acc1_meter.avg, val_acc5_meter.avg, val_time
+    return (val_loss_meter.avg,
+            val_acc1_meter.avg,
+            val_acc5_meter.avg,
+            master_val_loss_meter.avg,
+            master_val_acc1_meter.avg,
+            master_val_acc5_meter.avg,
+            val_time)
 
 
 def main_worker(*args):
-    # 0. Preparation
+    # STEP 0: Preparation
+    config = args[0]
     dist.init_parallel_env()
     last_epoch = config.TRAIN.LAST_EPOCH
-    world_size = paddle.distributed.get_world_size()
-    local_rank = paddle.distributed.get_rank()
-    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    world_size = dist.get_world_size()
+    local_rank = dist.get_rank()
     seed = config.SEED + local_rank
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    # 1. Create model
+    # logger for each process/gpu
+    local_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log_{}.txt'.format(local_rank)),
+            logger_name='local_logger')
+    # overall logger
+    if local_rank == 0:
+        master_logger = get_logger(
+            filename=os.path.join(config.SAVE, 'log.txt'),
+            logger_name='master_logger')
+        master_logger.info(f'\n{config}')
+    else:
+        master_logger = None
+    local_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    if local_rank == 0:
+        master_logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    
+    # STEP 1: Create model
     model = build_model(config)
     model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train, dataset_val = args[0], args[1]
-    dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+
+    # STEP 2: Create train and val dataloader
+    dataset_train, dataset_val = args[1], args[2]
+    # Create training dataloader
+    if not config.EVAL:
+        dataloader_train = get_dataloader(config, dataset_train, 'train', True)
+        total_batch_train = len(dataloader_train)
+        local_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+        if local_rank == 0:
+            master_logger.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    # Create validation dataloader
     dataloader_val = get_dataloader(config, dataset_val, 'test', True)
-    total_batch_train = len(dataloader_train)
     total_batch_val = len(dataloader_val)
-    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
-    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define optimizer and lr_scheduler
+    local_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    if local_rank == 0:
+        master_logger.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE * world_size) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -242,7 +394,9 @@ def main_worker(*args):
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        local_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
 
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
@@ -273,76 +427,120 @@ def main_worker(*args):
             #    'absolute_pos_embed', 'relative_position_bias_table']),
             )
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        local_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        if local_rank == 0:
+            master_logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
 
-    # 5. Load pretrained model / load resumt model and optimizer states
+    # STEP 6: Load pretrained model / load resumt model and optimizer states
     if config.MODEL.PRETRAINED:
         if (config.MODEL.PRETRAINED).endswith('.pdparams'):
             raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
         assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
-        logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        local_logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
         opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
         optimizer.set_state_dict(opt_state)
-        logger.info(
-            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+        local_logger.info(
+            f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
+        if local_rank == 0:
+            master_logger.info(
+                f"----- Resume Training: Load model and optmizer from {config.MODEL.RESUME}")
     
-    # 6. Validation
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
-        logger.info('----- Start Validating')
-        val_loss, val_acc1, val_acc5, val_time = validate(
+        local_logger.info('----- Start Validating')
+        if local_rank == 0:
+            master_logger.info('----- Start Validating')
+        val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=total_batch_val,
-            debug_steps=config.REPORT_FREQ)
-        logger.info(f"Validation Loss: {val_loss:.4f}, " +
-                    f"Validation Acc@1: {val_acc1:.4f}, " +
-                    f"Validation Acc@5: {val_acc5:.4f}, " +
-                    f"time: {val_time:.2f}")
+            debug_steps=config.REPORT_FREQ,
+            local_logger=local_logger,
+            master_logger=master_logger)
+        local_logger.info(f"Validation Loss: {val_loss:.4f}, " +
+                          f"Validation Acc@1: {val_acc1:.4f}, " +
+                          f"Validation Acc@5: {val_acc5:.4f}, " +
+                          f"time: {val_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"Validation Loss: {avg_loss:.4f}, " +
+                               f"Validation Acc@1: {avg_acc1:.4f}, " +
+                               f"Validation Acc@5: {avg_acc5:.4f}, " +
+                               f"time: {val_time:.2f}")
         return
 
-    # 6. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+    # STEP 8: Start training and validation (train mode)
+    local_logger.info(f"Start training from epoch {last_epoch+1}.")
+    if local_rank == 0:
+        master_logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
-        train_loss, train_acc, train_time = train(dataloader=dataloader_train,
-                                                  model=model,
-                                                  criterion=criterion,
-                                                  optimizer=optimizer,
-                                                  epoch=epoch,
-                                                  total_batch=total_batch_train,
-                                                  debug_steps=config.REPORT_FREQ,
-                                                  accum_iter=config.TRAIN.ACCUM_ITER)
+        local_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        if local_rank == 0:
+            master_logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss, train_acc, avg_loss, avg_acc, train_time = train(
+            dataloader=dataloader_train,
+            model=model,
+            criterion=criterion,
+            optimizer=optimizer,
+            epoch=epoch,
+            total_epochs=config.TRAIN.NUM_EPOCHS,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER,
+            mixup_fn=mixup_fn,
+            amp=config.AMP,
+            local_logger=local_logger,
+            master_logger=master_logger)
+
         scheduler.step()
 
-        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                    f"Train Loss: {train_loss:.4f}, " +
-                    f"Train Acc: {train_acc:.4f}, " +
-                    f"time: {train_time:.2f}")
+        local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                          f"Train Loss: {train_loss:.4f}, " +
+                          f"Train Acc: {train_acc:.4f}, " +
+                          f"time: {train_time:.2f}")
+        if local_rank == 0:
+            master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                               f"Train Loss: {avg_loss:.4f}, " +
+                               f"Train Acc: {avg_acc:.4f}, " +
+                               f"time: {train_time:.2f}")
+
         # validation
         if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
-            logger.info(f'----- Validation after Epoch: {epoch}')
-            val_loss, val_acc1, val_acc5, val_time = validate(
+            local_logger.info(f'----- Validation after Epoch: {epoch}')
+            if local_rank == 0:
+                master_logger.info(f'----- Validation after Epoch: {epoch}')
+            val_loss, val_acc1, val_acc5, avg_loss, avg_acc1, avg_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=total_batch_val,
-                debug_steps=config.REPORT_FREQ)
-            logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
-                        f"Validation Loss: {val_loss:.4f}, " +
-                        f"Validation Acc@1: {val_acc1:.4f}, " +
-                        f"Validation Acc@5: {val_acc5:.4f}, " +
-                        f"time: {val_time:.2f}")
+                debug_steps=config.REPORT_FREQ,
+                local_logger=local_logger,
+                master_logger=master_logger)
+            local_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                              f"Validation Loss: {val_loss:.4f}, " +
+                              f"Validation Acc@1: {val_acc1:.4f}, " +
+                              f"Validation Acc@5: {val_acc5:.4f}, " +
+                              f"time: {val_time:.2f}")
+            if local_rank == 0:
+                master_logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                                   f"Validation Loss: {avg_loss:.4f}, " +
+                                   f"Validation Acc@1: {avg_acc1:.4f}, " +
+                                   f"Validation Acc@5: {avg_acc5:.4f}, " +
+                                   f"time: {val_time:.2f}")
         # model save
         if local_rank == 0:
             if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
@@ -350,15 +548,33 @@ def main_worker(*args):
                     config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
                 paddle.save(model.state_dict(), model_path + '.pdparams')
                 paddle.save(optimizer.state_dict(), model_path + '.pdopt')
-                logger.info(f"----- Save model: {model_path}.pdparams")
-                logger.info(f"----- Save optim: {model_path}.pdopt")
+                master_logger.info(f"----- Save model: {model_path}.pdparams")
+                master_logger.info(f"----- Save optim: {model_path}.pdopt")
 
 
 def main():
-    dataset_train = get_dataset(config, mode='train')
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
+
+    # get dataset and start DDP
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+    else:
+        dataset_train = None
     dataset_val = get_dataset(config, mode='val')
     config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
-    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+    dist.spawn(main_worker, args=(config, dataset_train, dataset_val, ), nprocs=config.NGPUS)
 
 
 if __name__ == "__main__":
diff --git a/image_classification/gMLP/main_single_gpu.py b/image_classification/gMLP/main_single_gpu.py
index fa11a1f4..83e2d8b6 100644
--- a/image_classification/gMLP/main_single_gpu.py
+++ b/image_classification/gMLP/main_single_gpu.py
@@ -1,5 +1,4 @@
-
-#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -27,53 +26,54 @@
 import paddle.nn.functional as F
 from datasets import get_dataloader
 from datasets import get_dataset
-from gmlp import build_gated_mlp as build_model
 from utils import AverageMeter
 from utils import WarmupCosineScheduler
+from utils import get_exclude_from_weight_decay_fn
 from config import get_config
 from config import update_config
+from mixup import Mixup
+from losses import LabelSmoothingCrossEntropyLoss
+from losses import SoftTargetCrossEntropyLoss
+from gmlp import build_gated_mlp as build_model
 
 
-parser = argparse.ArgumentParser('gMLP')
-parser.add_argument('-cfg', type=str, default=None)
-parser.add_argument('-dataset', type=str, default=None)
-parser.add_argument('-batch_size', type=int, default=None)
-parser.add_argument('-image_size', type=int, default=None)
-parser.add_argument('-data_path', type=str, default=None)
-parser.add_argument('-ngpus', type=int, default=None)
-parser.add_argument('-pretrained', type=str, default=None)
-parser.add_argument('-resume', type=str, default=None)
-parser.add_argument('-last_epoch', type=int, default=None)
-parser.add_argument('-eval', action='store_true')
-args = parser.parse_args()
-
-
-log_format = "%(asctime)s %(message)s"
-logging.basicConfig(stream=sys.stdout, level=logging.INFO,
-                    format=log_format, datefmt="%m%d %I:%M:%S %p")
-
-# get default config
-config = get_config()
-# update config by arguments
-config = update_config(config, args)
-
-# set output folder
-if not config.EVAL:
-    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-else:
-    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
-
-#config.freeze()
-
-if not os.path.exists(config.SAVE):
-    os.makedirs(config.SAVE, exist_ok=True)
-
-# set logging format
-logger = logging.getLogger()
-fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
-fh.setFormatter(logging.Formatter(log_format))
-logger.addHandler(fh)
-logger.info(f'config= {config}')
+def get_arguments():
+    """return argumeents, this will overwrite the config after loading yaml file"""
+    parser = argparse.ArgumentParser('gMLP')
+    parser.add_argument('-cfg', type=str, default=None)
+    parser.add_argument('-dataset', type=str, default=None)
+    parser.add_argument('-batch_size', type=int, default=None)
+    parser.add_argument('-image_size', type=int, default=None)
+    parser.add_argument('-data_path', type=str, default=None)
+    parser.add_argument('-output', type=str, default=None)
+    parser.add_argument('-ngpus', type=int, default=None)
+    parser.add_argument('-num_classes', type=int, default=None)
+    parser.add_argument('-pretrained', type=str, default=None)
+    parser.add_argument('-resume', type=str, default=None)
+    parser.add_argument('-last_epoch', type=int, default=None)
+    parser.add_argument('-eval', action='store_true')
+    parser.add_argument('-amp', action='store_true')
+    arguments = parser.parse_args()
+    return arguments
+
+
+def get_logger(filename, logger_name=None):
+    """set logging file and format
+    Args:
+        filename: str, full path of the logger file to write
+        logger_name: str, the logger name, e.g., 'master_logger', 'local_logger'
+    Return:
+        logger: python logger
+    """
+    log_format = "%(asctime)s %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                        format=log_format, datefmt="%m%d %I:%M:%S %p")
+    # different name is needed when creating multiple logger in one process
+    logger = logging.getLogger(logger_name)
+    fh = logging.FileHandler(os.path.join(filename))
+    fh.setFormatter(logging.Formatter(log_format))
+    logger.addHandler(fh)
+    return logger
 
 
 def train(dataloader,
@@ -81,56 +81,82 @@ def train(dataloader,
           criterion,
           optimizer,
           epoch,
+          total_epochs,
           total_batch,
           debug_steps=100,
-          accum_iter=1):
+          accum_iter=1,
+          mixup_fn=None,
+          amp=False,
+          logger=None):
     """Training for one epoch
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
         epoch: int, current epoch
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
-        accum_iter: int, num of iters for accumulating gradients
+        total_epochs: int, total num of epochs
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        accum_iter: int, num of iters for accumulating gradients, default: 1
+        mixup_fn: Mixup, mixup instance, default: None
+        amp: bool, if True, use mix precision training, default: False
+        logger: logger for logging, default: None
     Returns:
-        train_loss_meter.avg
-        train_acc_meter.avg
-        train_time
+        train_loss_meter.avg: float, average loss on current process/gpu
+        train_acc_meter.avg: float, average top1 accuracy on current process/gpu
+        train_time: float, training time
     """
     model.train()
     train_loss_meter = AverageMeter()
     train_acc_meter = AverageMeter()
+
+    if amp is True:
+        scaler = paddle.amp.GradScaler(init_loss_scaling=1024)
     time_st = time.time()
 
     for batch_id, data in enumerate(dataloader):
         image = data[0]
         label = data[1]
+        label_orig = label.clone()
+
+        if mixup_fn is not None:
+            image, label = mixup_fn(image, label_orig)
+        
+        if amp is True: # mixed precision training
+            with paddle.amp.auto_cast():
+                output = model(image)
+                loss = criterion(output, label)
+            scaled = scaler.scale(loss)
+            scaled.backward()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                scaler.minimize(optimizer, scaled)
+                optimizer.clear_grad()
+        else: # full precision training
+            output = model(image)
+            loss = criterion(output, label)
+            #NOTE: division may be needed depending on the loss function
+            # Here no division is needed:
+            # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
+            #loss =  loss / accum_iter
+            loss.backward()
 
-        output = model(image)
-        loss = criterion(output, label)
-
-        #NOTE: division may be needed depending on the loss function
-        # Here no division is needed:
-        # default 'reduction' param in nn.CrossEntropyLoss is set to 'mean'
-        #loss =  loss / accum_iter
-
-        loss.backward()
-
-        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
-            optimizer.step()
-            optimizer.clear_grad()
+            if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+                optimizer.step()
+                optimizer.clear_grad()
 
         pred = F.softmax(output)
-        acc = paddle.metric.accuracy(pred, label.unsqueeze(1))
+        if mixup_fn:
+            acc = paddle.metric.accuracy(pred, label_orig)
+        else:
+            acc = paddle.metric.accuracy(pred, label_orig.unsqueeze(1))
 
         batch_size = image.shape[0]
         train_loss_meter.update(loss.numpy()[0], batch_size)
         train_acc_meter.update(acc.numpy()[0], batch_size)
 
-        if batch_id % debug_steps == 0:
+        if logger and batch_id % debug_steps == 0:
             logger.info(
-                f"Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                f"Epoch[{epoch:03d}/{total_epochs:03d}], " +
                 f"Step[{batch_id:04d}/{total_batch:04d}], " +
                 f"Avg Loss: {train_loss_meter.avg:.4f}, " +
                 f"Avg Acc: {train_acc_meter.avg:.4f}")
@@ -139,19 +165,20 @@ def train(dataloader,
     return train_loss_meter.avg, train_acc_meter.avg, train_time
 
 
-def validate(dataloader, model, criterion, total_batch, debug_steps=100):
+def validate(dataloader, model, criterion, total_batch, debug_steps=100, logger=None):
     """Validation for whole dataset
     Args:
         dataloader: paddle.io.DataLoader, dataloader instance
         model: nn.Layer, a ViT model
         criterion: nn.criterion
-        total_epoch: int, total num of epoch, for logging
-        debug_steps: int, num of iters to log info
+        total_batch: int, total num of batches for one epoch
+        debug_steps: int, num of iters to log info, default: 100
+        logger: logger for logging, default: None
     Returns:
-        val_loss_meter.avg
-        val_acc1_meter.avg
-        val_acc5_meter.avg
-        val_time
+        val_loss_meter.avg: float, average loss on current process/gpu
+        val_acc1_meter.avg: float, average top1 accuracy on current process/gpu
+        val_acc5_meter.avg: float, average top5 accuracy on current process/gpu
+        val_time: float, valitaion time
     """
     model.eval()
     val_loss_meter = AverageMeter()
@@ -176,7 +203,7 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
             val_acc1_meter.update(acc1.numpy()[0], batch_size)
             val_acc5_meter.update(acc5.numpy()[0], batch_size)
 
-            if batch_id % debug_steps == 0:
+            if logger and batch_id % debug_steps == 0:
                 logger.info(
                     f"Val Step[{batch_id:04d}/{total_batch:04d}], " +
                     f"Avg Loss: {val_loss_meter.avg:.4f}, " +
@@ -188,24 +215,77 @@ def validate(dataloader, model, criterion, total_batch, debug_steps=100):
 
 
 def main():
-    # 0. Preparation
+    # STEP 0: Preparation
+    # config is updated by: (1) config.py, (2) yaml file, (3) arguments
+    arguments = get_arguments()
+    config = get_config()
+    config = update_config(config, arguments)
+    # set output folder
+    if not config.EVAL:
+        config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    else:
+        config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+    if not os.path.exists(config.SAVE):
+        os.makedirs(config.SAVE, exist_ok=True)
     last_epoch = config.TRAIN.LAST_EPOCH
     seed = config.SEED
     paddle.seed(seed)
     np.random.seed(seed)
     random.seed(seed)
-    #paddle.set_device('gpu:0')
-    # 1. Create model
+    logger = get_logger(filename=os.path.join(config.SAVE, 'log.txt'))
+    logger.info(f'\n{config}')
+
+    # STEP 1: Create model
     model = build_model(config)
-    #model = paddle.DataParallel(model)
-    # 2. Create train and val dataloader
-    dataset_train = get_dataset(config, mode='train')
+
+    # STEP 2: Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = get_dataset(config, mode='train')
+        dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataset_val = get_dataset(config, mode='val')
-    dataloader_train = get_dataloader(config, dataset_train, 'train', False)
     dataloader_val = get_dataloader(config, dataset_val, 'val', False)
-    # 3. Define criterion
-    criterion = nn.CrossEntropyLoss()
-    # 4. Define lr_scheduler
+
+    # STEP 3: Define Mixup function
+    mixup_fn = None
+    if config.TRAIN.MIXUP_PROB > 0 or config.TRAIN.CUTMIX_ALPHA > 0 or config.TRAIN.CUTMIX_MINMAX is not None:
+        mixup_fn = Mixup(mixup_alpha=config.TRAIN.MIXUP_ALPHA,
+                         cutmix_alpha=config.TRAIN.CUTMIX_ALPHA,
+                         cutmix_minmax=config.TRAIN.CUTMIX_MINMAX,
+                         prob=config.TRAIN.MIXUP_PROB,
+                         switch_prob=config.TRAIN.MIXUP_SWITCH_PROB,
+                         mode=config.TRAIN.MIXUP_MODE,
+                         label_smoothing=config.TRAIN.SMOOTHING,
+                         num_classes=config.MODEL.NUM_CLASSES)
+
+    # STEP 4: Define criterion
+    if config.TRAIN.MIXUP_PROB > 0.:
+        criterion = SoftTargetCrossEntropyLoss()
+    elif config.TRAIN.SMOOTHING:
+        criterion = LabelSmoothingCrossEntropyLoss()
+    else:
+        criterion = nn.CrossEntropyLoss()
+    # only use cross entropy for val
+    criterion_val = nn.CrossEntropyLoss()
+
+    # STEP 5: Define optimizer and lr_scheduler
+    # set lr according to batch size and world size (hacked from Swin official code and modified for CSwin)
+    if config.TRAIN.LINEAR_SCALED_LR is not None:
+        linear_scaled_lr = (
+            config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_warmup_start_lr = (
+            config.TRAIN.WARMUP_START_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+        linear_scaled_end_lr = (
+            config.TRAIN.END_LR * config.DATA.BATCH_SIZE) / config.TRAIN.LINEAR_SCALED_LR
+    
+        if config.TRAIN.ACCUM_ITER > 1:
+            linear_scaled_lr = linear_scaled_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_warmup_start_lr = linear_scaled_warmup_start_lr * config.TRAIN.ACCUM_ITER
+            linear_scaled_end_lr = linear_scaled_end_lr * config.TRAIN.ACCUM_ITER
+        
+        config.TRAIN.BASE_LR = linear_scaled_lr
+        config.TRAIN.WARMUP_START_LR = linear_scaled_warmup_start_lr
+        config.TRAIN.END_LR = linear_scaled_end_lr
+
     scheduler = None
     if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
         scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
@@ -214,8 +294,7 @@ def main():
                                           end_lr=config.TRAIN.END_LR,
                                           warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
                                           total_epochs=config.TRAIN.NUM_EPOCHS,
-                                          last_epoch=config.TRAIN.LAST_EPOCH,
-                                          )
+                                          last_epoch=config.TRAIN.LAST_EPOCH)
     elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
         scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
                                                              T_max=config.TRAIN.NUM_EPOCHS,
@@ -227,9 +306,9 @@ def main():
                                                        gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
                                                        last_epoch=last_epoch)
     else:
-        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        logger.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
         raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-    # 5. Define optimizer
+
     if config.TRAIN.OPTIMIZER.NAME == "SGD":
         if config.TRAIN.GRAD_CLIP:
             clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
@@ -249,58 +328,67 @@ def main():
         optimizer = paddle.optimizer.AdamW(
             parameters=model.parameters(),
             learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
             beta1=config.TRAIN.OPTIMIZER.BETAS[0],
             beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
             epsilon=config.TRAIN.OPTIMIZER.EPS,
             grad_clip=clip)
     else:
-        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        logger.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
         raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    # 6. Load pretrained model or load resume model and optimizer states
+
+    # STEP 6: Load pretrained model or load resume model and optimizer states
     if config.MODEL.PRETRAINED:
-        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams')
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
         model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
         model.set_dict(model_state)
         logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
 
     if config.MODEL.RESUME:
-        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
-        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
-        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        assert os.path.isfile(config.MODEL.RESUME + '.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME + '.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME + '.pdparams')
         model.set_dict(model_state)
-        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        opt_state = paddle.load(config.MODEL.RESUME + '.pdopt')
         optimizer.set_state_dict(opt_state)
         logger.info(
             f"----- Resume: Load model and optmizer from {config.MODEL.RESUME}")
-    # 7. Validation
+
+    # STEP 7: Validation (eval mode)
     if config.EVAL:
         logger.info('----- Start Validating')
         val_loss, val_acc1, val_acc5, val_time = validate(
             dataloader=dataloader_val,
             model=model,
-            criterion=criterion,
+            criterion=criterion_val,
             total_batch=len(dataloader_val),
-            debug_steps=config.REPORT_FREQ)
+            debug_steps=config.REPORT_FREQ,
+            logger=logger)
         logger.info(f"Validation Loss: {val_loss:.4f}, " +
                     f"Validation Acc@1: {val_acc1:.4f}, " +
                     f"Validation Acc@5: {val_acc5:.4f}, " +
                     f"time: {val_time:.2f}")
         return
-    # 8. Start training and validation
-    logging.info(f"Start training from epoch {last_epoch+1}.")
+
+    # STEP 8: Start training and validation (train mode)
+    logger.info(f"Start training from epoch {last_epoch+1}.")
     for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
         # train
-        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        logger.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
         train_loss, train_acc, train_time = train(dataloader=dataloader_train,
                                                   model=model,
                                                   criterion=criterion,
                                                   optimizer=optimizer,
                                                   epoch=epoch,
+                                                  total_epochs=config.TRAIN.NUM_EPOCHS,
                                                   total_batch=len(dataloader_train),
                                                   debug_steps=config.REPORT_FREQ,
                                                   accum_iter=config.TRAIN.ACCUM_ITER,
-                                                  )
+                                                  mixup_fn=mixup_fn,
+                                                  amp=config.AMP,
+                                                  logger=logger)
         scheduler.step()
         logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                     f"Train Loss: {train_loss:.4f}, " +
@@ -312,9 +400,10 @@ def main():
             val_loss, val_acc1, val_acc5, val_time = validate(
                 dataloader=dataloader_val,
                 model=model,
-                criterion=criterion,
+                criterion=criterion_val,
                 total_batch=len(dataloader_val),
-                debug_steps=config.REPORT_FREQ)
+                debug_steps=config.REPORT_FREQ,
+                logger=logger)
             logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
                         f"Validation Loss: {val_loss:.4f}, " +
                         f"Validation Acc@1: {val_acc1:.4f}, " +
diff --git a/image_classification/gMLP/mixup.py b/image_classification/gMLP/mixup.py
new file mode 100644
index 00000000..1d2db493
--- /dev/null
+++ b/image_classification/gMLP/mixup.py
@@ -0,0 +1,225 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""mixup and cutmix for batch data"""
+import numpy as np
+import paddle
+
+
+def rand_bbox(image_shape, lam, count=None):
+    """ CutMix bbox by lam value
+    Generate 1 random bbox by value lam. lam is the cut size rate.
+    The cut_size is computed by sqrt(1-lam) * image_size.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        count: int, number of bbox to generate
+    """
+    image_h, image_w = image_shape[-2:]
+    cut_rate = np.sqrt(1. - lam)
+    cut_h = int(cut_rate * image_h)
+    cut_w = int(cut_rate * image_w)
+
+    # get random bbox center
+    cy = np.random.randint(0, image_h, size=count)
+    cx = np.random.randint(0, image_w, size=count)
+
+    # get bbox coords
+    bbox_x1 = np.clip(cx - cut_w // 2, 0, image_w)
+    bbox_y1 = np.clip(cy - cut_h // 2, 0, image_h)
+    bbox_x2 = np.clip(cx + cut_w // 2, 0, image_w)
+    bbox_y2 = np.clip(cy + cut_h // 2, 0, image_h)
+
+    # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+    # if x1 == x2, paddle will raise ValueErros, 
+    # while in pytorch, it will return [] tensor
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def rand_bbox_minmax(image_shape, minmax, count=None):
+    """ CutMix bbox by min and max value
+    Generate 1 random bbox by min and max percentage values.
+    Minmax is a tuple/list of min and max percentage vlaues
+    applied to the image width and height.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        minmax: tuple/list, min and max percentage values of image size
+        count: int, number of bbox to generate
+    """
+    assert len(minmax) == 2
+    image_h, image_w = image_shape[-2:]
+    min_ratio = minmax[0]
+    max_ratio = minmax[1]
+    cut_h = np.random.randint(int(image_h * min_ratio), int(image_h * max_ratio), size=count) 
+    cut_w = np.random.randint(int(image_w * min_ratio), int(image_w * max_ratio), size=count) 
+
+    bbox_x1 = np.random.randint(0, image_w - cut_w, size=count)
+    bbox_y1 = np.random.randint(0, image_h - cut_h, size=count)
+    bbox_x2 = bbox_x1 + cut_w
+    bbox_y2 = bbox_y1 + cut_h
+
+    return bbox_x1, bbox_y1, bbox_x2, bbox_y2
+
+
+def cutmix_generate_bbox_adjust_lam(image_shape, lam, minmax=None, correct_lam=True, count=None):
+    """Generate bbox and apply correction for lambda
+    If the mimmax is None, apply the standard cutmix by lam value,
+    If the minmax is set, apply the cutmix by min and max percentage values.
+
+    Args:
+        image_shape: tuple/list, image height and width
+        lam: float, cutmix lambda value
+        minmax: tuple/list, min and max percentage values of image size
+        correct_lam: bool, if True, correct the lam value by the generated bbox
+        count: int, number of bbox to generate
+    """
+    if minmax is not None:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox_minmax(image_shape, minmax, count)
+    else:
+        bbox_x1, bbox_y1, bbox_x2, bbox_y2 = rand_bbox(image_shape, lam, count)
+
+    if correct_lam or minmax is not None:
+        image_h, image_w = image_shape[-2:]
+        bbox_area = (bbox_y2 - bbox_y1) * (bbox_x2 - bbox_x1)
+        lam = 1. - bbox_area / float(image_h * image_w)
+    return (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam
+
+
+def one_hot(x, num_classes, on_value=1., off_value=0.):
+    """ Generate one-hot vector for label smoothing
+    Args:
+        x: tensor, contains label/class indices
+        num_classes: int, num of classes (len of the one-hot vector)
+        on_value: float, the vector value at label index, default=1.
+        off_value: float, the vector value at non-label indices, default=0.
+    Returns:
+        one_hot: tensor, tensor with on value at label index and off value
+                 at non-label indices.
+    """
+    x = x.reshape_([-1, 1])
+    x_smoothed = paddle.full((x.shape[0], num_classes), fill_value=off_value)
+    for i in range(x.shape[0]):
+        x_smoothed[i, x[i]] = on_value
+    return x_smoothed
+
+
+def mixup_one_hot(label, num_classes, lam=1., smoothing=0.):
+    """ mixup and label smoothing in batch
+    label smoothing is firstly applied, then
+    mixup is applied by mixing the bacth and its flip,
+    with a mixup rate.
+
+    Args:
+        label: tensor, label tensor with shape [N], contains the class indices
+        num_classes: int, num of all classes
+        lam: float, mixup rate, default=1.0
+        smoothing: float, label smoothing rate
+    """
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(label, num_classes, on_value, off_value)
+    y2 = one_hot(label.flip(axis=[0]), num_classes, on_value, off_value)
+    return y2 * (1 - lam) + y1 * lam
+
+
+class Mixup:
+    """Mixup class
+    Args:
+        mixup_alpha: float, mixup alpha for beta distribution, default=1.0,
+        cutmix_alpha: float, cutmix alpha for beta distribution, default=0.0,
+        cutmix_minmax: list/tuple, min and max value for cutmix ratio, default=None,
+        prob: float, if random prob < prob, do not use mixup, default=1.0,
+        switch_prob: float, prob of switching mixup and cutmix, default=0.5,
+        mode: string, mixup up, now only 'batch' is supported, default='batch',
+        correct_lam: bool, if True, apply correction of lam, default=True,
+        label_smoothing: float, label smoothing rate, default=0.1,
+        num_classes: int, num of classes, default=1000
+    """
+    def __init__(self,
+                 mixup_alpha=1.0,
+                 cutmix_alpha=0.0,
+                 cutmix_minmax=None,
+                 prob=1.0,
+                 switch_prob=0.5,
+                 mode='batch',
+                 correct_lam=True,
+                 label_smoothing=0.1,
+                 num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if cutmix_minmax is not None:
+            assert len(cutmix_minmax) == 2
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam
+        assert mode == 'batch', 'Now only batch mode is supported!'
+
+    def __call__(self, x, target):
+        assert x.shape[0] % 2 == 0, "Batch size should be even"
+        lam = self._mix_batch(x)
+        target = mixup_one_hot(target, self.num_classes, lam, self.label_smoothing)
+        return x, target
+
+    def get_params(self):
+        """Decide to use cutmix or regular mixup by sampling and
+           sample lambda for mixup
+        """
+        lam = 1.
+        use_cutmix = False
+        use_mixup = np.random.rand() < self.mix_prob
+        if use_mixup:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                alpha = self.cutmix_alpha if use_cutmix else self.mixup_alpha
+                lam_mix = np.random.beta(alpha, alpha)
+            elif self.mixup_alpha == 0. and self.cutmix_alpha > 0.:
+                use_cutmix=True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            elif self.mixup_alpha > 0. and self.cutmix_alpha == 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            else:
+                raise ValueError('mixup_alpha and cutmix_alpha cannot be all 0')
+            lam = float(lam_mix)
+        return lam, use_cutmix
+
+    def _mix_batch(self, x):
+        """mixup/cutmix by adding batch data and its flipped version"""
+        lam, use_cutmix = self.get_params()
+        if lam == 1.:
+            return lam
+        if use_cutmix:
+            (bbox_x1, bbox_y1, bbox_x2, bbox_y2), lam = cutmix_generate_bbox_adjust_lam(
+                x.shape,
+                lam,
+                minmax=self.cutmix_minmax,
+                correct_lam=self.correct_lam)
+
+            # NOTE: in paddle, tensor indexing e.g., a[x1:x2],
+            # if x1 == x2, paddle will raise ValueErros, 
+            # but in pytorch, it will return [] tensor without errors
+            if int(bbox_x1) != int(bbox_x2) and int(bbox_y1) != int(bbox_y2):
+                x[:, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)] = x.flip(axis=[0])[
+                    :, :, int(bbox_x1): int(bbox_x2), int(bbox_y1): int(bbox_y2)]
+        else:
+            x_flipped = x.flip(axis=[0])
+            x_flipped = x_flipped * (1 - lam)
+            x.set_value(x * (lam) + x_flipped)
+        return lam
diff --git a/image_classification/gMLP/port_weights/__init__.py b/image_classification/gMLP/port_weights/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/image_classification/gMLP/random_erasing.py b/image_classification/gMLP/random_erasing.py
new file mode 100644
index 00000000..31eea465
--- /dev/null
+++ b/image_classification/gMLP/random_erasing.py
@@ -0,0 +1,118 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Random Erasing for image tensor"""
+
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    if rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+#def main():
+#    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+#    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+#    import PIL.Image as Image
+#    import numpy as np
+#    paddle.set_device('cpu')
+#    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+#    img = img / 255.0
+#    img = paddle.transpose(img, [2, 0, 1])
+#    new_img = re(img)
+#    new_img = new_img * 255.0
+#    new_img = paddle.transpose(new_img, [1, 2, 0])
+#    new_img = new_img.cpu().numpy()
+#    new_img = Image.fromarray(new_img.astype('uint8'))
+#    new_img.save('./res.png')
+#
+#
+#
+#if __name__ == "__main__":
+#    main()
diff --git a/image_classification/gMLP/run_train.sh b/image_classification/gMLP/run_train.sh
index 2c394f3b..f8836109 100644
--- a/image_classification/gMLP/run_train.sh
+++ b/image_classification/gMLP/run_train.sh
@@ -2,5 +2,6 @@ CUDA_VISIBLE_DEVICES=7 \
 python main_single_gpu.py \
 -cfg='./configs/gmlp_s16_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
+-amp
diff --git a/image_classification/gMLP/run_train_multi.sh b/image_classification/gMLP/run_train_multi.sh
index 2692f218..2cd4c708 100644
--- a/image_classification/gMLP/run_train_multi.sh
+++ b/image_classification/gMLP/run_train_multi.sh
@@ -2,6 +2,6 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 \
 python main_multi_gpu.py \
 -cfg='./configs/gmlp_s16_224.yaml' \
 -dataset='imagenet2012' \
--batch_size=32 \
+-batch_size=8 \
 -data_path='/dataset/imagenet' \
--ngpus=4
+-amp
diff --git a/image_classification/gMLP/transforms.py b/image_classification/gMLP/transforms.py
new file mode 100644
index 00000000..5a046912
--- /dev/null
+++ b/image_classification/gMLP/transforms.py
@@ -0,0 +1,14 @@
+import random
+import paddle
+import paddle.nn
+import paddle.vision.transforms as T
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+
+    def __call__(self, image):
+        if random.random() < self.p:
+            return T.hflip(image) 
+        return image
diff --git a/object_detection/DETR/README.md b/object_detection/DETR/README.md
new file mode 100644
index 00000000..26592847
--- /dev/null
+++ b/object_detection/DETR/README.md
@@ -0,0 +1,173 @@
+# End-to-End Object Detection with Transformers, [arxiv](https://arxiv.org/abs/2005.12872) 
+
+PaddlePaddle training/validation code and pretrained models for **DETR**.
+
+The official pytorch implementation is [here](https://github.com/facebookresearch/detr).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT).
+
+
+
+<img src="./detr.png" alt="drawing" width="100%" height="100%"/>
+<figcaption align = "center">DETR Model Overview</figcaption>
+
+### Update 
+Update (2021-09-01): Code is released and ported weights are uploaded.
+
+## Models Zoo
+| Model | backbone  | box_mAP | Model                                                                                                                                                       |
+|-------|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| DETR  | ResNet50  | 42.0    | [google](https://drive.google.com/file/d/1ruIKCqfh_MMqzq_F4L2Bv-femDMjS_ix/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1J6lB1mezd6_eVW3jnmohZA)(n5gk) |
+| DETR  | ResNet101 | 43.5    | [google](https://drive.google.com/file/d/11HCyDJKZLX33_fRGp4bCg1I14vrIKYW5/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1_msuuAwFMNbAlMpgUq89Og)(bxz2) |
+
+> *The results are evaluated on COCO validation set.
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+COCO2017 dataset is used in the following folder structure:
+```
+COCO dataset folder
+├── annotations
+│   ├── captions_train2017.json
+│   ├── captions_val2017.json
+│   ├── instances_train2017.json
+│   ├── instances_val2017.json
+│   ├── person_keypoints_train2017.json
+│   └── person_keypoints_val2017.json
+├── train2017
+│   ├── 000000000009.jpg
+│   ├── 000000000025.jpg
+│   ├── 000000000030.jpg
+│   ├── 000000000034.jpg
+|   ...
+└── val2017
+    ├── 000000000139.jpg
+    ├── 000000000285.jpg
+    ├── 000000000632.jpg
+    ├── 000000000724.jpg
+    ...
+```
+
+More details about the COCO dataset can be found [here](../../docs/paddlevit-coco.md) and COCO [official dataset](https://cocodataset.org/#download).
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./detr_resnet50.pdparams`, to use the `detr` model in python:
+```python
+from config import get_config
+from detr import build_detr
+# config files in ./configs/
+config = get_config('./configs/detr_resnet50.yaml')
+# build model
+model, critertion, postprocessors = build_detr(config)
+# load pretrained weights
+model_state_dict = paddle.load('./detr_resnet50.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate DETR model performance on COCO2017 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/detr_resnet50.yaml \
+    -dataset=coco \
+    -batch_size=4 \
+    -data_path=/path/to/dataset/coco/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/detr_resnet50  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/detr_resnet50.yaml \
+    -dataset=coco \
+    -batch_size=4 \
+    -data_path=/path/to/dataset/coco/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/detr_resnet50  # .pdparams is NOT needed
+```
+
+</details>
+
+
+## Training
+To train the DETR model on COCO2017 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=1 \
+python main_single_gpu.py \
+    -cfg=./configs/detr_resnet50.yaml \
+    -dataset=coco \
+    -batch_size=2 \
+    -data_path=/path/to/dataset/coco/train
+```
+
+<details>
+
+<summary>
+Run training using multi-GPUs (coming soon):
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/detr_resnet50.yaml \
+    -dataset=coco \
+    -batch_size=2 \
+    -data_path=/path/to/dataset/coco/train
+```
+
+</details>
+
+## Visualization
+coming soon
+
+## Reference
+```
+@inproceedings{carion2020end,
+  title={End-to-end object detection with transformers},
+  author={Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey},
+  booktitle={European Conference on Computer Vision},
+  pages={213--229},
+  year={2020},
+  organization={Springer}
+}
+```
+
diff --git a/object_detection/DETR/box_ops.py b/object_detection/DETR/box_ops.py
index 921760bc..4f31926f 100644
--- a/object_detection/DETR/box_ops.py
+++ b/object_detection/DETR/box_ops.py
@@ -1,5 +1,32 @@
+import numpy as np
 import paddle
 
+
+def box_xyxy_to_cxcywh_numpy(box):
+    """convert box from top-left/bottom-right format:
+    [x0, y0, x1, y1]
+    to center-size format:
+    [center_x, center_y, width, height]
+
+    Args:
+        box: numpy array, last_dim=4, stop-left/bottom-right format boxes
+    Return:
+        numpy array, last_dim=4, center-size format boxes
+    """
+
+    #x0, y0, x1, y1 = box.unbind(-1)
+    x0 = box[:, 0]
+    y0 = box[:, 1]
+    x1 = box[:, 2]
+    y1 = box[:, 3]
+    xc = x0 + (x1-x0)/2
+    yc = y0 + (y1-y0)/2
+    w = x1 - x0
+    h = y1 - y0
+    return np.stack([xc, yc, w, h], axis=-1)
+
+
+
 def box_cxcywh_to_xyxy(box):
     """convert box from center-size format:
     [center_x, center_y, width, height]
diff --git a/object_detection/DETR/coco.py b/object_detection/DETR/coco.py
index 90a3ff54..f8bc8354 100644
--- a/object_detection/DETR/coco.py
+++ b/object_detection/DETR/coco.py
@@ -101,16 +101,13 @@ def convert_coco_poly_to_mask(segmentations, height, width):
         mask = coco_mask.decode(rles)
         if len(mask.shape) < 3:
             mask = mask[..., None]
-        # paddle any only support bool type
-        mask = paddle.to_tensor(mask, dtype='bool') # w x h x 1
-        mask = mask.any(axis=2).squeeze(-1) # w x h
-        # paddle stack does not support bool type
-        mask = mask.astype('int32')
+
+        mask = mask.any(axis=2).squeeze(-1)
         masks.append(mask)
     if masks:
-        masks = paddle.stack(masks, axis=0)
+        masks = np.stack(masks, axis=0)
     else:
-        mask = paddle.zeros((0, height, width), dtype='int32')
+        mask = np.zeros((0, height, width), dtype='int32')
     return masks
 
 
@@ -122,27 +119,24 @@ def __init__(self, return_masks=False):
     def __call__(self, image, target):
         w, h = image.size
         image_id = target['image_id']
-        image_id = paddle.to_tensor([image_id])
+        # Cuda may raise error, use cpu tensor instead
+        #image_id = paddle.to_tensor([image_id]).cpu()
 
         anno = target['annotations']
         anno = [obj for obj in anno if 'iscrowd' not in obj or obj['iscrowd'] == 0]
 
         boxes = [obj['bbox'] for obj in anno]
-        # Temp Fix: do it in numpy to skip paddl cuda error
-        boxes = np.array(boxes)
+
+        ## Temp Fix: do it in numpy to skip paddle cuda error
+        boxes = np.array(boxes, dtype='float32')
         boxes = boxes.reshape([-1, 4])
         boxes[:, 2:] += boxes[:, :2] # (n, (x1, y1, x2, y2))
 
-        boxes = paddle.to_tensor(boxes, dtype='float32')
-        # paddle indexing may cause cuda errors
-        #boxes = boxes.reshape([-1, 4]) # (n, (x1, y1, box_w, box_h))
-        #boxes[:, 2:] += boxes[:, :2] # (n, (x1, y1, x2, y2))
-
-        boxes[:, 0::2].clip_(min=0, max=w) # clip bbox inside image
-        boxes[:, 1::2].clip_(min=0, max=h) # clip bbox inside image
+        boxes[:, 0::2].clip(0, w) # clip bbox inside image
+        boxes[:, 1::2].clip(0, h) # clip bbox inside image
 
         classes = [obj['category_id'] for obj in anno]
-        classes = paddle.to_tensor(classes, dtype='float32')
+        classes = np.array(classes, dtype='float32')
 
         if self.return_masks:
             segmentations = [obj['segmentation'] for obj in anno]
@@ -151,23 +145,23 @@ def __call__(self, image, target):
         keypoints = None
         if anno and 'keypoints' in anno[0]:
             keypoints = [obj['keypoints'] for obj in anno]
-            keypoints = paddle.to_tensor(keypoints, dtype='float32')
+            keypoints = np.array(keypoints, dtype='float32')
             num_keypoints = keypoints.shape[0]
             if num_keypoints:
-                keypoints = keypoints.reshape_((num_keypoints, -1, 3))
+                keypoints = keypoints.reshape((num_keypoints, -1, 3))
 
         #TODO: should be replaced with paddle buildin logical ops in the future
-        boxes_tmp = boxes.cpu().numpy()
+        boxes_tmp = boxes
         keep = (boxes_tmp[:, 3] > boxes_tmp[:, 1]) & (boxes_tmp[:, 2] > boxes_tmp[:, 0])
         keep_idx = np.where(keep)[0].astype('int32')
-        keep = paddle.to_tensor(keep_idx)
 
-        boxes = boxes.index_select(keep, axis=0)
-        classes = classes.index_select(keep, axis=0)
+        boxes = boxes[keep]
+        classes = classes[keep]
+
         if self.return_masks:
-            masks = masks.index_select(keep, axis=0)
+            masks = masks[keep]
         if keypoints is not None:
-            keypoints = keypoints.index_select(keep, axis=0)
+            keypoints = keypoints[keep]
 
         target = {}
         target['boxes'] = boxes
@@ -178,13 +172,13 @@ def __call__(self, image, target):
             target['keypoints'] = keypoints
         target['image_id'] = image_id
 
-        area = paddle.to_tensor([obj['area'] for obj in anno])
-        iscrowd = paddle.to_tensor([obj['iscrowd'] if 'iscrowd' in obj else 0 for obj in anno])
+        area = np.array([obj['area'] for obj in anno], dtype='float32')
+        iscrowd = np.array([obj['iscrowd'] if 'iscrowd' in obj else 0 for obj in anno], dtype='float32')
         target['area'] = area
-        target['iscrowd'] = iscrowd.index_select(keep, axis=0)
+        target['iscrowd'] = iscrowd[keep]
 
-        target['orig_size'] = paddle.to_tensor([int(h), int(w)])
-        target['size'] = paddle.to_tensor([int(h), int(w)])
+        target['orig_size'] = np.array([int(h), int(w)], dtype='float32')
+        target['size'] = np.array([int(h), int(w)], dtype='float32')
 
         return image, target
 
diff --git a/object_detection/DETR/config.py b/object_detection/DETR/config.py
index 6a8f2979..06bf4f92 100644
--- a/object_detection/DETR/config.py
+++ b/object_detection/DETR/config.py
@@ -32,7 +32,7 @@
 _C.DATA.BATCH_SIZE_EVAL = 8 #64 # val batch_size for single GPU
 _C.DATA.DATA_PATH = '/dataset/coco/' # path to dataset
 _C.DATA.DATASET = 'coco' # dataset name
-_C.DATA.NUM_WORKERS = 1 # number of data loading threads
+_C.DATA.NUM_WORKERS = 2 # number of data loading threads
 
 # model settings
 _C.MODEL = CN()
@@ -66,7 +66,7 @@
 _C.TRAIN.WARMUP_START_LR = 1e-6 #0.0
 _C.TRAIN.END_LR = 1e-5
 _C.TRAIN.GRAD_CLIP = 1.0
-_C.TRAIN.ACCUM_ITER = 2 #1
+_C.TRAIN.ACCUM_ITER = 1 #1
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
@@ -84,9 +84,9 @@
 _C.SAVE = "./output"
 _C.TAG = "default"
 _C.SAVE_FREQ = 20 # freq to save chpt
-_C.REPORT_FREQ = 50 # freq to logging info
+_C.REPORT_FREQ = 10 # freq to logging info
 _C.VALIDATE_FREQ = 20 # freq to do validation
-_C.SEED = 0
+_C.SEED = 42
 _C.EVAL = False # run evaluation only
 _C.LOCAL_RANK = 0
 _C.NGPUS = -1
@@ -106,6 +106,12 @@ def _update_config_from_file(config, cfg_file):
     config.freeze()
 
 def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
     if args.cfg:
         _update_config_from_file(config, args.cfg)
     config.defrost()
@@ -115,6 +121,8 @@ def update_config(config, args):
         config.DATA.BATCH_SIZE = args.batch_size
     if args.data_path:
         config.DATA.DATA_PATH = args.data_path
+    if args.ngpus:
+        config.NGPUS = args.ngpus
     if args.eval:
         config.EVAL = True
         config.DATA.BATCH_SIZE_EVAL = args.batch_size
@@ -122,13 +130,20 @@ def update_config(config, args):
         config.MODEL.PRETRAINED = args.pretrained
     if args.backbone:
         config.MODEL.BACKBONE = args.backbone
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.MODEL.LAST_EPOCH = args.last_epoch
 
     #config.freeze()
     return config
 
 
-def get_config():
+def get_config(cfg_file=None):
+    """Return a clone of config or load from yaml file"""
     config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
     return config
 
 
diff --git a/object_detection/DETR/configs/detr_resnet50.yaml b/object_detection/DETR/configs/detr_resnet50.yaml
index bdcb017c..25cb29ee 100644
--- a/object_detection/DETR/configs/detr_resnet50.yaml
+++ b/object_detection/DETR/configs/detr_resnet50.yaml
@@ -1,4 +1,21 @@
 DATA:
-    BATCH_SIZE: 8 
+    BATCH_SIZE: 2 
+MODEL:
+    DROPOUT: 0.1
+    BACKBONE: "resnet50"
+    NUM_QUERIES: 100
+    TRANS:
+        NUM_ENCODER_LAYERS: 6
+        NUM_DECODER_LAYERS: 6
+        MLP_DIM: 2048
+        HIDDEN_SIZE: 256
+        NUM_HEADS: 8
+
+TRAIN:
+    BASE_LR: 1e-4
+    GRAD_CLIP: 0.1
+    WEIGHT_DECAY: 1e-4
+    NUM_EPOCHS: 300
+    
 
 
diff --git a/object_detection/DETR/detr.png b/object_detection/DETR/detr.png
new file mode 100644
index 00000000..71c9c9f1
Binary files /dev/null and b/object_detection/DETR/detr.png differ
diff --git a/object_detection/DETR/main_multi_gpu.py b/object_detection/DETR/main_multi_gpu.py
index 43db92e5..ded802f0 100644
--- a/object_detection/DETR/main_multi_gpu.py
+++ b/object_detection/DETR/main_multi_gpu.py
@@ -43,6 +43,8 @@
 parser.add_argument('-backbone', type=str, default=None)
 parser.add_argument('-ngpus', type=int, default=None)
 parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
 parser.add_argument('-eval', action='store_true')
 arguments = parser.parse_args()
 
@@ -205,7 +207,8 @@ def validate(dataloader, model, criterion, postprocessors, base_ds, total_batch,
             # coco evaluate
             orig_target_sizes = paddle.stack([t['orig_size'] for t in targets], axis=0)
             results = postprocessors['bbox'](outputs, orig_target_sizes)
-            res = {target['image_id'].cpu().numpy()[0]: output for target, output in zip(targets, results)}
+            res = {target['image_id']: output for target, output in zip(targets, results)}
+            #res = {target['image_id'].cpu().numpy()[0]: output for target, output in zip(targets, results)}
             
             if coco_evaluator is not None:
                 coco_evaluator.update(res)
diff --git a/object_detection/DETR/main_single_gpu.py b/object_detection/DETR/main_single_gpu.py
index 6330105c..12b20227 100644
--- a/object_detection/DETR/main_single_gpu.py
+++ b/object_detection/DETR/main_single_gpu.py
@@ -40,6 +40,8 @@
 parser.add_argument('-backbone', type=str, default=None)
 parser.add_argument('-ngpus', type=int, default=None)
 parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
 parser.add_argument('-eval', action='store_true')
 args = parser.parse_args()
 
@@ -151,7 +153,8 @@ def validate(dataloader, model, criterion, postprocessors, base_ds, total_batch,
             # coco evaluate
             orig_target_sizes = paddle.stack([t['orig_size'] for t in targets], axis=0)
             results = postprocessors['bbox'](outputs, orig_target_sizes)
-            res = {target['image_id'].cpu().numpy()[0]: output for target, output in zip(targets, results)}
+            #res = {target['image_id'].cpu().numpy()[0]: output for target, output in zip(targets, results)}
+            res = {target['image_id']: output for target, output in zip(targets, results)}
             if coco_evaluator is not None:
                 coco_evaluator.update(res)
 
diff --git a/object_detection/DETR/matcher.py b/object_detection/DETR/matcher.py
index 57aa328e..05b8922a 100644
--- a/object_detection/DETR/matcher.py
+++ b/object_detection/DETR/matcher.py
@@ -58,7 +58,7 @@ def forward(self, outputs, targets):
 
             idx_list = []
             for v in targets:
-                if not v['labels'].is_empty():
+                if v['labels'].shape[0] != 0:
                     idx_list.append(v['labels'])
             if len(idx_list) > 0:
                 tgt_idx = paddle.concat(idx_list)
@@ -72,7 +72,7 @@ def forward(self, outputs, targets):
             #tgt_bbox = paddle.concat([v['boxes'] for v in targets])
             bbox_list = []
             for v in targets:
-                if not v['boxes'].is_empty():
+                if v['boxes'].shape[0] != 0:
                     bbox_list.append(v['boxes'])
             if len(bbox_list) > 0:
                 tgt_bbox = paddle.concat(bbox_list)
@@ -94,6 +94,9 @@ def forward(self, outputs, targets):
             # conver back to numpy for temp use
             out_bbox = out_bbox.cpu().numpy()
             tgt_bbox = tgt_bbox.cpu().numpy()
+            #print(out_bbox)
+            #print('----')
+            #print(tgt_bbox)
             cost_bbox = distance.cdist(out_bbox, tgt_bbox, 'minkowski', p=1).astype('float32')
             cost_bbox = paddle.to_tensor(cost_bbox)
 
diff --git a/object_detection/DETR/run_eval_multi.sh b/object_detection/DETR/run_eval_multi.sh
index 52b84627..fb4fe368 100644
--- a/object_detection/DETR/run_eval_multi.sh
+++ b/object_detection/DETR/run_eval_multi.sh
@@ -1,8 +1,8 @@
-CUDA_VISIBLE_DEVICES=4,5,6,7 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
 python main_multi_gpu.py \
 -cfg='./configs/detr_resnet50.yaml' \
 -dataset='coco' \
--batch_size=2 \
+-batch_size=4 \
 -data_path='/dataset/coco' \
 -eval \
 -pretrained='./detr_resnet50'
diff --git a/object_detection/DETR/run_train.sh b/object_detection/DETR/run_train.sh
new file mode 100644
index 00000000..78a42704
--- /dev/null
+++ b/object_detection/DETR/run_train.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=1 \
+python main_single_gpu.py \
+-cfg='./configs/detr_resnet50.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
diff --git a/object_detection/DETR/run_train_finetune.sh b/object_detection/DETR/run_train_finetune.sh
new file mode 100644
index 00000000..75032890
--- /dev/null
+++ b/object_detection/DETR/run_train_finetune.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=1 \
+python main_single_gpu.py \
+-cfg='./configs/detr_resnet50.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
+-pretrained='./detr_resnet50'
diff --git a/object_detection/DETR/run_train_finetune_multi.sh b/object_detection/DETR/run_train_finetune_multi.sh
new file mode 100644
index 00000000..65a221de
--- /dev/null
+++ b/object_detection/DETR/run_train_finetune_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0,1 \
+python main_multi_gpu.py \
+-cfg='./configs/detr_resnet50.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
+-pretrained='./detr_resnet50'
diff --git a/object_detection/DETR/run_train_multi.sh b/object_detection/DETR/run_train_multi.sh
new file mode 100644
index 00000000..894f9379
--- /dev/null
+++ b/object_detection/DETR/run_train_multi.sh
@@ -0,0 +1,6 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/detr_resnet50.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
diff --git a/object_detection/DETR/transforms.py b/object_detection/DETR/transforms.py
index 9bf99194..a2fa732e 100644
--- a/object_detection/DETR/transforms.py
+++ b/object_detection/DETR/transforms.py
@@ -22,20 +22,25 @@
 from paddle.vision.transforms import functional as F
 from random_erasing import RandomErasing
 from box_ops import box_xyxy_to_cxcywh
+from box_ops import box_xyxy_to_cxcywh_numpy
 
 
 def crop(image, target, region):
     cropped_image = T.crop(image, *region)
     target = target.copy()
     i, j, h, w = region
-    target['size'] = paddle.to_tensor([h, w])
+    #target['size'] = paddle.to_tensor([h, w]).cpu()
+    target['size'] = np.array([h, w], dtype='float32')
     fields = ['labels', 'area', 'iscrowd']
 
     if 'boxes' in target:
         boxes = target['boxes']
-        max_size = paddle.to_tensor([h, w], dtype='float32')
-        cropped_boxes = boxes - paddle.to_tensor([j, i, j, i], dtype='float32') # box are (x1, y1, x2, y2)
-        cropped_boxes = paddle.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
+        #max_size = paddle.to_tensor([h, w], dtype='float32').cpu()
+        max_size = np.array([h, w], dtype='float32')
+        #cropped_boxes = boxes - paddle.to_tensor([j, i, j, i], dtype='float32').cpu() # box are (x1, y1, x2, y2)
+        cropped_boxes = boxes - np.array([j, i, j, i], dtype='float32') # box are (x1, y1, x2, y2)
+        #cropped_boxes = paddle.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
+        cropped_boxes = np.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
         cropped_boxes = cropped_boxes.clip(min=0)
         area = (cropped_boxes[:, 1, :] - cropped_boxes[:, 0, :]).prod(axis=1)
         target['boxes'] = cropped_boxes.reshape([-1, 4])
@@ -55,33 +60,35 @@ def crop(image, target, region):
             # This paddle api will raise error in current env
             #keep = paddle.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], axis=1)
             # Instead we use numpy for temp fix
-            cropped_boxes = cropped_boxes.cpu().numpy()
+            #cropped_boxes = cropped_boxes.cpu().numpy()
             keep  = np.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], axis=1)
             #keep = keep.cpu().numpy()
         else:
             keep = target['masks'].flatten(1).any(1)
-            keep = keep.cpu().numpy()
+            #keep = keep.cpu().numpy()
 
         keep_idx = np.where(keep)[0].astype('int32')
-        keep = paddle.to_tensor(keep_idx)
+        #keep = paddle.to_tensor(keep_idx).cpu()
+        keep = keep_idx
 
         for field in fields:
-            target[field] = target[field].index_select(keep, axis=0)
+            #target[field] = target[field].index_select(keep, axis=0)
+            target[field] = target[field][keep]
 
     return cropped_image, target
 
 
 def hflip(image, target):
     flipped_image = T.hflip(image)
-
     w, h = image.size
-
     target = target.copy()
     if 'boxes' in target:
         boxes = target['boxes'] # n x 4
-        boxes = boxes.index_select(paddle.to_tensor([2, 1, 0, 3], dtype='int32'), axis=1)
-        boxes = boxes * paddle.to_tensor(
-                [-1, 1, -1, 1], dtype='float32') + paddle.to_tensor([w, 0, w, 0], dtype='float32')
+        #boxes = boxes.index_select(paddle.to_tensor([2, 1, 0, 3], dtype='int32').cpu(), axis=1)
+        boxes = boxes[:, [2, 1, 0, 3]]
+        #boxes = boxes * paddle.to_tensor(
+        #        [-1, 1, -1, 1], dtype='float32').cpu() + paddle.to_tensor([w, 0, w, 0], dtype='float32').cpu()
+        boxes = boxes * np.array([-1, 1, -1, 1], dtype='float32') + np.array([w, 0, w, 0], dtype='float32')
         target['boxes'] = boxes
 
     if 'masks' in target:
@@ -156,7 +163,8 @@ def get_size(image_size, size, max_size=None):
         if boxes.shape[0] == 0: # empty boxes
             scaled_boxes = boxes
         else: # this line works well in pytorch, but not in paddle
-            scaled_boxes = boxes * paddle.to_tensor([ratio_width, ratio_height, ratio_width, ratio_height])
+            #scaled_boxes = boxes * paddle.to_tensor([ratio_width, ratio_height, ratio_width, ratio_height]).cpu()
+            scaled_boxes = boxes * np.array([ratio_width, ratio_height, ratio_width, ratio_height], dtype='float32')
         target['boxes'] = scaled_boxes
 
     if 'area' in target:
@@ -165,15 +173,18 @@ def get_size(image_size, size, max_size=None):
         target['area'] = scaled_area
 
     h, w = size
-    target['size'] = paddle.to_tensor([h, w])
+    #target['size'] = paddle.to_tensor([h, w]).cpu()
+    target['size'] = np.array([h, w], dtype='float32')
 
     if 'masks' in target:
         masks = target['masks'] # [N, H, W]
         masks = masks.unsqueeze(-1).astype('float32') #[N, H, W, 1]
+        masks = paddle.to_tensor(masks).cpu()
         masks = paddle.nn.functional.interpolate(
                     masks, size, data_format='NHWC')  #[N, H', W', 1]
         masks = masks[:, :, :, 0] > 0.5
         masks = masks.astype('int32')
+        masks = masks.numpy()
         target['masks'] = masks
 
     return rescaled_image, target
@@ -184,7 +195,8 @@ def pad(image, target, padding):
     if target is None:
         return padded_image, None
     target = target.copy()
-    target['size'] = paddle.to_tensor(padded_image.size[::-1])
+    #target['size'] = paddle.to_tensor(padded_image.size[::-1]).cpu()
+    target['size'] = np.array(padded_image.size[::-1], dtype='float32')
     if 'masks' in target:
         target['masks'] = T.pad(target['masks'], (0, padding[0], 0, padding[1]))
     return padded_image, target
@@ -211,8 +223,8 @@ def _get_image_size(img):
         if w == tw and h == th:
             return 0, 0, h, w
 
-        i = random.randint(0, h - th)
-        j = random.randint(0, w - tw)
+        i = random.randint(0, h - th + 1)
+        j = random.randint(0, w - tw + 1)
         return i, j, th, tw
 
     def __call__(self, image, target):
@@ -321,9 +333,11 @@ def __call__(self, image, target=None):
         h, w = image.shape[-2:]
         if 'boxes' in target and target['boxes'].shape[0] != 0:
             boxes = target['boxes']
-            boxes = box_xyxy_to_cxcywh(boxes)
-            boxes = boxes / paddle.to_tensor([w, h, w, h], dtype='float32')
+            boxes = box_xyxy_to_cxcywh_numpy(boxes)
+            #boxes = boxes / paddle.to_tensor([w, h, w, h], dtype='float32').cpu()
+            boxes = boxes / np.array([w, h, w, h], dtype='float32')
             target['boxes'] = boxes
+
         return image, target
 
 
diff --git a/object_detection/DETR/utils.py b/object_detection/DETR/utils.py
index 9304c319..9045c35f 100644
--- a/object_detection/DETR/utils.py
+++ b/object_detection/DETR/utils.py
@@ -114,8 +114,8 @@ def nested_tensor_from_tensor_list(tensor_list):
         s1 = tensor_list[idx].shape[1]
         s2 = tensor_list[idx].shape[2]
         # direct set value raise error in current env, we use numpy to bypass
-        data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx].cpu().numpy()
-        #data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx]
+        #data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx].cpu().numpy()
+        data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx]
         mask[idx, : s1, : s2] = 0
     return NestedTensor(data_tensor, mask)
 
diff --git a/object_detection/PVTv2/README.md b/object_detection/PVTv2/README.md
new file mode 100644
index 00000000..b0407205
--- /dev/null
+++ b/object_detection/PVTv2/README.md
@@ -0,0 +1,179 @@
+# PVTv2: Improved Baselines with Pyramid Vision Transformer, [arxiv](https://arxiv.org/abs/2106.13797) 
+
+PaddlePaddle training/validation code and pretrained models for **PVTv2 Detection**.
+
+The official pytorch implementation is [here](https://github.com/whai362/PVT/tree/v2/detection).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT).
+
+
+
+<img src="./pvtv2.png" alt="drawing" width="60%" height="60%"/>
+<figcaption align = "center">PVTv2 Model Overview</figcaption>
+
+### Update 
+Update (2021-09-15): Code is released and Mask R-CNN ported weights are uploaded.
+
+## Models Zoo
+| Model | backbone  | box_mAP | Model                                                                                                                                                       |
+|-------|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Mask R-CNN | pvtv2_b0 		|  38.3   | [google](https://drive.google.com/file/d/1wA324LkFtGezHJovSZ4luVqSxVt9woFc/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1q67ZIDSHn9Y-HU_WoQr8OQ)(3kqb) |
+| Mask R-CNN | pvtv2_b1 		|  41.8   | [google](https://drive.google.com/file/d/1alNaSmR4TSXsPpGoUZr2QQf5phYQjIzN/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1aSkuDiNpxdnFWE1Wn1SWNw)(k5aq) |
+| Mask R-CNN | pvtv2_b2 		|  45.2   | [google](https://drive.google.com/file/d/1tg6B5OEV4OWLsDxTCjsWgxgaSgIh4cID/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1DLwxCZVZizb5HKih7RFw2w)(jh8b) |
+| Mask R-CNN | pvtv2_b2_linear 	|  44.1   | [google](https://drive.google.com/file/d/1b26vxK3QVGx5ovqKir77NyY6YPgAWAEj/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16T-Nyo_Jm2yDq4aoXpdnbg)(8ipt) |
+| Mask R-CNN | pvtv2_b3 		|  46.9   | [google](https://drive.google.com/file/d/1H6ZUCixCaYe1AvlBkuqYoxzz4b-icJ3u/view?usp=sharing)/[baidu](https://pan.baidu.com/s/16QVsjUOXijo5d9cO3FZ39A)(je4y) |
+| Mask R-CNN | pvtv2_b4 		|  47.5   | [google](https://drive.google.com/file/d/1pXQNpn0BoKqiuVaGtJL18eWG6XmdlBOL/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1yhX7mpmb2wbRvWZFnUloBQ)(n3ay) |
+| Mask R-CNN | pvtv2_b5 		|  47.4   | [google](https://drive.google.com/file/d/12vOyw6pUfK1NdOWBF758aAZuaf-rZLvx/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1-gasQk9PqLMkrWXw4aX41g)(jzq1) |
+
+> *The results are evaluated on COCO validation set.
+
+- Backbone model weights can be found in PVTv2 classification [here](../../image_classification/PVTv2).
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+COCO2017 dataset is used in the following folder structure:
+```
+COCO dataset folder
+├── annotations
+│   ├── captions_train2017.json
+│   ├── captions_val2017.json
+│   ├── instances_train2017.json
+│   ├── instances_val2017.json
+│   ├── person_keypoints_train2017.json
+│   └── person_keypoints_val2017.json
+├── train2017
+│   ├── 000000000009.jpg
+│   ├── 000000000025.jpg
+│   ├── 000000000030.jpg
+│   ├── 000000000034.jpg
+|   ...
+└── val2017
+    ├── 000000000139.jpg
+    ├── 000000000285.jpg
+    ├── 000000000632.jpg
+    ├── 000000000724.jpg
+    ...
+```
+
+More details about the COCO dataset can be found [here](../../docs/paddlevit-coco.md) and COCO [official dataset](https://cocodataset.org/#download).
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./ pvtv2_b0_maskrcnn.pdparams`, to use the `pvtv2` model in python:
+```python
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+# config files in ./configs/
+config = get_config('./configs/pvtv2_b0.yaml')
+# build model
+model = build_pvtv2_det(config)
+# load pretrained weights
+model_state_dict = paddle.load('./pvtv2_b0_maskrcnn.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate PVTv2 model performance on COCO2017 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=coco \
+    -batch_size=4 \
+    -data_path=/path/to/dataset/coco/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/pvtv2_b0_maskrcnn  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=coco \
+    -batch_size=4 \
+    -data_path=/path/to/dataset/coco/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/pvtv2_b0_maskrcnn  # .pdparams is NOT needed
+```
+
+</details>
+
+
+## Training
+To train the PVTv2 model on COCO2017 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=1 \
+python main_single_gpu.py \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=coco \
+    -batch_size=2 \
+    -data_path=/path/to/dataset/coco/train \
+    -pretrained=/path/to/pretrained/model/pvtv2_b0  # .pdparams is NOT needed
+```
+The `pretrined` arguments sets the pretrained backbone weights, which can be found in PVTv2 classification [here](../../image_classification/PVTv2).
+<details>
+
+<summary>
+Run training using multi-GPUs (coming soon):
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/pvtv2_b0.yaml \
+    -dataset=coco \
+    -batch_size=2 \
+    -data_path=/path/to/dataset/coco/train \
+    -pretrained=/path/to/pretrained/model/pvtv2_b0  # .pdparams is NOT needed
+```
+The `pretrined` arguments sets the pretrained backbone weights, which can be found in PVTv2 classification [here](../../image_classification/PVTv2).
+</details>
+
+## Visualization
+coming soon
+
+## Reference
+```
+@article{wang2021pvtv2,
+  title={Pvtv2: Improved baselines with pyramid vision transformer},
+  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Fan, Deng-Ping and Song, Kaitao and Liang, Ding and Lu, Tong and Luo, Ping and Shao, Ling},
+  journal={arXiv preprint arXiv:2106.13797},
+  year={2021}
+}
+```
diff --git a/object_detection/PVTv2/box_ops.py b/object_detection/PVTv2/box_ops.py
new file mode 100644
index 00000000..64040a46
--- /dev/null
+++ b/object_detection/PVTv2/box_ops.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" box related operations """
+
+import numpy as np
+import paddle
+
+
+def box_xyxy_to_cxcywh_numpy(box):
+    """convert box from top-left/bottom-right format:
+    [x0, y0, x1, y1]
+    to center-size format:
+    [center_x, center_y, width, height]
+
+    Args:
+        box: numpy array, last_dim=4, stop-left/bottom-right format boxes
+    Return:
+        numpy array, last_dim=4, center-size format boxes
+    """
+
+    #x0, y0, x1, y1 = box.unbind(-1)
+    x0 = box[:, 0]
+    y0 = box[:, 1]
+    x1 = box[:, 2]
+    y1 = box[:, 3]
+    xc = x0 + (x1-x0)/2
+    yc = y0 + (y1-y0)/2
+    w = x1 - x0
+    h = y1 - y0
+    return np.stack([xc, yc, w, h], axis=-1)
+
+
+def box_cxcywh_to_xyxy(box):
+    """convert box from center-size format:
+    [center_x, center_y, width, height]
+    to top-left/bottom-right format:
+    [x0, y0, x1, y1]
+
+    Args:
+        box: paddle.Tensor, last_dim=4, stores center-size format boxes
+    Return:
+        paddle.Tensor, last_dim=4, top-left/bottom-right format boxes
+    """
+
+    x_c, y_c, w, h = box.unbind(-1)
+    x0 = x_c - 0.5 * w
+    y0 = y_c - 0.5 * h
+    x1 = x_c + 0.5 * w
+    y1 = y_c + 0.5 * h
+    return paddle.stack([x0, y0, x1, y1], axis=-1)
+
+
+def box_xyxy_to_cxcywh(box):
+    """convert box from top-left/bottom-right format:
+    [x0, y0, x1, y1]
+    to center-size format:
+    [center_x, center_y, width, height]
+
+    Args:
+        box: paddle.Tensor, last_dim=4, stop-left/bottom-right format boxes
+    Return:
+        paddle.Tensor, last_dim=4, center-size format boxes
+    """
+
+    x0, y0, x1, y1 = box.unbind(-1)
+    xc = x0 + (x1-x0)/2
+    yc = y0 + (y1-y0)/2
+    w = x1 - x0
+    h = y1 - y0
+    return paddle.stack([xc, yc, w, h], axis=-1)
+
+
+def box_area(boxes):
+    """ compute area of a set of boxes in (x1, y1, x2, y2) format
+    Args:
+        boxes: paddle.Tensor, shape = Nx4, must in (x1, y1, x2, y2) format
+    Return:
+        areas: paddle.Tensor, N, areas of each box
+    """
+
+    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+
+
+def box_iou(boxes1, boxes2):
+    """compute iou of 2 sets of boxes in (x1, y1, x2, y2) format
+
+    This method returns the iou between every pair of boxes
+    in two sets of boxes.
+
+    Args:
+        boxes1: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+        boxes2: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+    Return:
+        iou: iou ratios between each pair of boxes in boxes1 and boxes2
+        union: union areas between each pair of boxes in boxes1 and boxes2
+    """
+
+    area1 = box_area(boxes1)
+    area2 = box_area(boxes2)
+
+    boxes1 = boxes1.unsqueeze(1) # N x 1 x 4
+    lt = paddle.maximum(boxes1[:, :, :2], boxes2[:, :2])
+    rb = paddle.minimum(boxes1[:, :, 2:], boxes2[:, 2:])
+
+    wh = (rb - lt).clip(min=0)
+    inter = wh[:, :, 0] * wh[:, :, 1]
+
+    union = area1.unsqueeze(1) + area2 - inter # broadcast
+
+    iou = inter / union
+    return iou, union
+
+
+def generalized_box_iou(boxes1, boxes2):
+    """Compute GIoU of each pais in boxes1 and boxes2
+
+    GIoU = IoU - |A_c - U| / |A_c|
+    where A_c is the smallest convex hull that encloses both boxes, U is the union of boxes
+    Details illustrations can be found in https://giou.stanford.edu/
+
+    Args:
+        boxes1: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+        boxes2: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+    Return:
+        giou: giou ratios between each pair of boxes in boxes1 and boxes2
+    """
+
+    iou, union = box_iou(boxes1, boxes2)
+
+    boxes1 = boxes1.unsqueeze(1) # N x 1 x 4
+    lt = paddle.minimum(boxes1[:, :, :2], boxes2[:, :2])
+    rb = paddle.maximum(boxes1[:, :, 2:], boxes2[:, 2:])
+
+    wh = (rb - lt).clip(min=0)
+    area = wh[:, :, 0] * wh[:, :, 1]
+
+    return iou - (area-union) / area
+
+
+def masks_to_boxes(masks):
+    """convert masks to bboxes
+
+    Args:
+        masks: paddle.Tensor, NxHxW
+    Return:
+        boxes: paddle.Tensor, Nx4
+    """
+
+    if masks.numel() == 0:
+        return paddle.zeros((0, 4))
+    h, w = masks.shape[-2:]
+    y = paddle.arange(0, h, dtype='float32')
+    x = paddle.arange(0, w, dtype='float32')
+    y, x = paddle.meshgrid(y, x)
+
+    x_mask = (masks * x.unsqueeze(0))
+    x_max = x_mask.flatten(1).max(-1)[0]
+
+    #x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)
+    x_min = paddle.where(masks == 0, paddle.ones_like(x_mask)*float(1e8), x_mask)
+    x_min = x_min.flatten(1).min(-1)[0]
+
+    y_mask = (masks * y.unsqueeze(0))
+    y_max = y_mask.flatten(1).max(-1)[0]
+    #y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
+    y_min = paddle.where(masks == 0, paddle.ones_like(y_mask) * float(1e8), y_mask)
+    y_min = y_min.flatten(1).min(-1)[0]
+
+    return paddle.stack([x_min, y_min, x_max, y_max], 1)
diff --git a/object_detection/PVTv2/coco.py b/object_detection/PVTv2/coco.py
new file mode 100644
index 00000000..5015ca7a
--- /dev/null
+++ b/object_detection/PVTv2/coco.py
@@ -0,0 +1,329 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset(COCO2017) related classes and methods for DETR training and validation
+"""
+
+import os
+import copy
+import numpy as np
+from PIL import Image
+import paddle
+from pycocotools.coco import COCO
+from pycocotools import mask as coco_mask
+import transforms as T
+from utils import nested_tensor_from_tensor_list
+
+
+class CocoDetection(paddle.io.Dataset):
+    """ COCO Detection dataset
+
+    This class gets images and annotations for paddle training and validation.
+    Transform(preprocessing) can be applied in __getitem__ method.
+
+    Attributes:
+        img_folder: path where coco images is stored, e.g.{COCO_PATH}/train2017
+        anno_file: path where annotation json file is stored
+        transforms: transforms applied on data, see make_coco_transform for details
+        return_masks: if true, return coco masks, default: False (now only support False)
+    """
+
+    def __init__(self, img_folder, anno_file, transforms, return_masks):
+        super().__init__()
+        self.coco = COCO(anno_file)
+        # coco all image ids
+        ids = list(sorted(self.coco.imgs.keys()))
+        # remove ids where anno has no bboxes
+        self.ids = self._remove_images_without_annotations(ids)
+        self._transforms = transforms
+        # prepare filters labels and put image and label to paddle tensors
+        self.prepare = ConvertCocoPolysToMasks(return_masks)
+        self.root = img_folder
+        self.ids2cats = {id: cat for id, cat in enumerate(self.coco.getCatIds())}
+        self.cats2ids = {cat: id for id, cat in enumerate(self.coco.getCatIds())}
+
+    def _remove_images_without_annotations(self, ids):
+        new_ids = []
+        rm_cnt = 0
+        for idx in ids:
+            annos = self._load_target(idx)
+            boxes = []
+            for anno in annos:
+                if 'bbox' in anno:
+                    boxes.append(anno['bbox'])
+            if len(boxes) == 0:
+                rm_cnt += 1
+                continue
+            new_ids.append(idx)
+        print(f'loading coco data, {rm_cnt} imgs without annos are removed')
+        return new_ids
+
+    def _load_image(self, idx):
+        """ Return PIL Image (RGB) according to COCO image id"""
+        path = self.coco.loadImgs(idx)[0]['file_name']
+        return Image.open(os.path.join(self.root, path)).convert('RGB')
+
+    def _load_target(self, idx):
+        """ Return image annos according to COCO image id"""
+        return self.coco.loadAnns(self.coco.getAnnIds(idx))
+
+    def _tgt2rcnn(self, target):
+        target['gt_boxes'] = target['boxes']
+        # target['gt_classes'] = target['labels']
+        gt_cats = target['labels']
+        target['gt_classes'] = np.array(
+            [self.cats2ids[int(gt_cats[i])] for i in range(len(gt_cats))], dtype='float32')
+
+        target['imgs_shape'] = target['size'].astype("float32")
+        target['scale_factor_wh'] = np.array(
+            [float(target['size'][1]) / float(target['orig_size'][1]),
+             float(target['size'][0]) / float(target['orig_size'][0])], dtype='float32')
+
+        target.pop("boxes")
+        target.pop("labels")
+        target.pop("size")
+
+        return target
+
+    def __len__(self):
+        return len(self.ids)
+
+    def __getitem__(self, idx):
+        """idx is for training image id, not COCO image id"""
+        image_id = self.ids[idx]
+        image = self._load_image(image_id)
+        target = self._load_target(image_id)
+        target = {'image_id': image_id, 'annotations': target}
+
+        image, target = self.prepare(image, target)
+        if self._transforms is not None:
+            image, target = self._transforms(image, target)
+
+        target = self._tgt2rcnn(target)
+
+        return image, target
+
+
+def convert_coco_poly_to_mask(segmentations, height, width):
+    """ Convert coco anno from polygons to image masks"""
+    masks = []
+    for polygons in segmentations:
+        rles = coco_mask.frPyObjects(polygons, height, width)
+        mask = coco_mask.decode(rles)
+        if len(mask.shape) < 3:
+            mask = mask[..., None]
+        mask = mask.any(axis=2).squeeze(-1) # w x h
+        masks.append(mask)
+    if masks:
+        masks = np.stack(masks, axis=0)
+    else:
+        mask = np.zeros((0, height, width), dtype='int32')
+    return masks
+
+
+class ConvertCocoPolysToMasks():
+    """ Prepare coco annotations to paddle tensors"""
+    def __init__(self, return_masks=False):
+        self.return_masks = return_masks
+
+    def __call__(self, image, target):
+        w, h = image.size
+        image_id = target['image_id']
+
+        anno = target['annotations']
+        anno = [obj for obj in anno if 'iscrowd' not in obj or obj['iscrowd'] == 0]
+
+        boxes = [obj['bbox'] for obj in anno]
+        boxes = np.array(boxes, dtype='float32')
+        boxes = boxes.reshape([-1, 4])
+        boxes[:, 2:] += boxes[:, :2]
+        boxes[:, 0::2].clip(0, w)
+        boxes[:, 1::2].clip(0, h)
+
+        classes = [obj['category_id'] for obj in anno]
+        classes = np.array(classes, dtype='float32')
+
+        if self.return_masks:
+            segmentations = [obj['segmentation'] for obj in anno]
+            masks = convert_coco_poly_to_mask(segmentations, h, w)  # [N, H, W] int32 array
+
+        keypoints = None
+        if anno and 'keypoints' in anno[0]:
+            keypoints = [obj['keypoints'] for obj in anno]
+            keypoints = np.array(keypoints, dtype='float32')
+            num_keypoints = keypoints.shape[0]
+            if num_keypoints:
+                keypoints = keypoints.reshape((num_keypoints, -1, 3))
+
+        boxes_tmp = boxes
+        keep = (boxes_tmp[:, 3] > boxes_tmp[:, 1]) & (boxes_tmp[:, 2] > boxes_tmp[:, 0])
+        #keep_idx = np.where(keep)[0].astype('int32')
+
+        boxes = boxes[keep]
+        classes = classes[keep]
+
+        if self.return_masks:
+            masks = masks[keep]
+        if keypoints is not None:
+            keypoints = keypoints[keep]
+
+        target = {}
+        target['boxes'] = boxes
+        target['labels'] = classes
+        if self.return_masks:
+            target['masks'] = masks
+        if keypoints is not None:
+            target['keypoints'] = keypoints
+        target['image_id'] = image_id
+
+        area = np.array([obj['area'] for obj in anno])
+        iscrowd = np.array([obj['iscrowd'] if 'iscrowd' in obj else 0 for obj in anno])
+        target['area'] = area
+        target['iscrowd'] = iscrowd[keep]
+
+        target['orig_size'] = np.array([int(h), int(w)], dtype='float32')
+        target['size'] = np.array([int(h), int(w)], dtype='float32')
+
+        return image, target
+
+
+def make_coco_transforms(image_set):
+    """ return transforms(class defined in ./transforms.py) for coco train and val"""
+    normalize = T.Compose([
+        T.ToTensor(),
+        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+    ])
+
+    scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
+
+    if image_set == 'train':
+        return T.Compose([
+            T.RandomHorizontalFlip(),
+            T.RandomSelect(
+                T.RandomResize(scales, max_size=1333),
+                T.Compose([
+                    T.RandomResize([400, 500, 600]),
+                    T.RandomSizeCrop(384, 600),
+                    T.RandomResize(scales, max_size=1333),
+                ])
+         ),
+            normalize,
+        ])
+
+    if image_set == 'val':
+        return T.Compose([
+            T.RandomResize([800], max_size=1333),
+            #T.Pad(size_divisor=32),
+            normalize,
+        ])
+
+    raise ValueError(f'Unknown {image_set}')
+
+
+def build_coco(image_set, coco_path, masks=False):
+    """Return CocoDetection dataset according to image_set: ['train', 'val']"""
+    assert image_set in ['train', 'val'], f'image_set {image_set} not supported'
+    assert os.path.exists(coco_path), f'provided COCO path {coco_path} does not exist'
+    mode = 'instances'
+    paths = {
+        'train': (os.path.join(coco_path, 'train2017'),
+                  os.path.join(coco_path, 'annotations', f'{mode}_train2017.json')),
+        'val': (os.path.join(coco_path, 'val2017'),
+                os.path.join(coco_path, 'annotations', f'{mode}_val2017.json')),
+    }
+    img_folder, anno_file = paths[image_set]
+    dataset = CocoDetection(img_folder,
+                            anno_file,
+                            transforms=make_coco_transforms(image_set),
+                            return_masks=masks)
+    return dataset
+
+
+def get_dataloader(dataset, batch_size, mode='train', multi_gpu=False):
+    """ return dataloader on train/val set for single/multi gpu
+    Arguments:
+        dataset: paddle.io.Dataset, coco dataset
+        batch_size: int, num of samples in one batch
+        mode: str, ['train', 'val'], dataset to use
+        multi_gpu: bool, if True, DistributedBatchSampler is used for DDP
+    """
+    if multi_gpu:
+        sampler = paddle.io.DistributedBatchSampler(
+            dataset,
+            batch_size=batch_size,
+            shuffle=(mode == 'train'),
+            drop_last=True)
+        #TODO: may need to fix this drop_last of multi-gpu dataloading error
+        # currently, val may drop several samples, which will lower the performance
+        # an idea is to pad the last batch in collate_fn
+        dataloader = paddle.io.DataLoader(dataset,
+                                          batch_sampler=sampler,
+                                          collate_fn=collate_fn)
+    else:
+        dataloader = paddle.io.DataLoader(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'),
+                                          collate_fn=collate_fn)
+    return dataloader
+
+
+def collate_fn(batch):
+    """Collate function for batching samples
+    Samples varies in sizes, here convert samples to NestedTensor which pads the tensor,
+    and generate the corresponding mask, so that the whole batch is of the same size.
+    """
+    # eliminate invalid data (where boxes is [] tensor)
+    old_batch_len = len(batch)
+    batch = [x for x in batch if x[1]['gt_boxes'].shape[0] != 0]
+    # try refill empty sample by other sample in current batch
+
+    new_batch_len = len(batch)
+    for i in range(new_batch_len, old_batch_len):
+        batch.append(copy.deepcopy(batch[i%new_batch_len]))
+
+    batch = list(zip(*batch)) # batch[0]: data tensor, batch[1]: targets dict
+
+    # size divisibility pad the image size which is divisible to i.e. 32
+    batch[0] = nested_tensor_from_tensor_list(batch[0], size_divisibility=32)
+
+    val_batch = [list(x.values()) for x in batch[1]]
+    key_batch = list(batch[1][0].keys())
+    tgt_batch = {}
+
+    for k, data in zip(key_batch, zip(*val_batch)):
+        if isinstance(data, (list, tuple)):
+            res = []
+            for item in data:
+                res.append(paddle.to_tensor(item))
+            tgt_batch[k] = res
+        else:
+            tgt_batch[k] = paddle.to_tensor(data)
+
+    #batch_target = []
+    #for single_target in batch[1]:
+    #    target_tensor_dict = {}
+    #    for key, val in single_target.items():
+    #        if isinstance(val, (list, tuple)):
+    #            res = []
+    #            for item in val:
+    #                res.append(paddle.to_tensor(item))
+    #            target_tensor_dict[key] = res
+    #        else:
+    #            target_tensor_dict[key] = paddle.to_tensor(val)
+    #    batch_target.append(target_tensor_dict)
+
+
+    batch[1] = tgt_batch
+    return tuple(batch)
diff --git a/object_detection/PVTv2/coco_eval.py b/object_detection/PVTv2/coco_eval.py
new file mode 100644
index 00000000..7c9d8e91
--- /dev/null
+++ b/object_detection/PVTv2/coco_eval.py
@@ -0,0 +1,267 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import contextlib
+import copy
+import numpy as np
+import paddle
+
+from pycocotools.cocoeval import COCOeval
+from pycocotools.coco import COCO
+import pycocotools.mask as mask_util
+
+from utils import all_gather
+
+class CocoEvaluator():
+    def __init__(self, coco_gt, iou_types):
+        assert isinstance(iou_types, (list, tuple))
+        coco_gt = copy.deepcopy(coco_gt)
+        self.coco_gt = coco_gt
+        self.iou_types = iou_types
+        self.coco_eval = {}
+        for iou_type in iou_types:
+            self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
+        self.img_ids = []
+        self.eval_imgs = {k: [] for k in iou_types}
+
+        self.ids2cats = {id:cat for id, cat in enumerate(self.coco_gt.getCatIds())}
+        self.cats2ids = {cat:id for id, cat in enumerate(self.coco_gt.getCatIds())}
+
+    def update(self, predictions):
+        img_ids = list(np.unique(list(predictions.keys())))
+        self.img_ids.extend(img_ids)
+
+        for iou_type in self.iou_types:
+            results = self.prepare(predictions, iou_type)
+
+            with open(os.devnull, 'w') as devnull:
+                with contextlib.redirect_stdout(devnull):
+                    coco_dt = COCO.loadRes(self.coco_gt, results) if results else COCO()
+            coco_eval = self.coco_eval[iou_type]
+
+            coco_eval.cocoDt = coco_dt
+            coco_eval.params.imgIds = list(img_ids)
+            img_ids, eval_imgs = evaluate(coco_eval)
+            #print('eval_imgs shape: ', eval_imgs.shape)
+
+            self.eval_imgs[iou_type].append(eval_imgs)
+
+    def synchronize_between_processes(self):
+        for iou_type in self.iou_types:
+            self.eval_imgs[iou_type] = np.concatenate(self.eval_imgs[iou_type], 2)
+            create_common_coco_eval(self.coco_eval[iou_type],
+                                    self.img_ids,
+                                    self.eval_imgs[iou_type])
+
+    def accumulate(self):
+        for coco_eval in self.coco_eval.values():
+            coco_eval.accumulate()
+
+    def summarize(self):
+        stats_dict = {}
+        for iou_type, coco_eval in self.coco_eval.items():
+            print(f'IoU metric: {iou_type}')
+            coco_eval.summarize()
+            stats_dict[iou_type] = coco_eval.stats
+        return stats_dict
+
+    def prepare(self, predictions, iou_type):
+        if iou_type == 'bbox':
+            return self.prepare_for_coco_detection(predictions)
+        elif iou_type == 'segm':
+            return self.prepare_for_coco_segmentation(predictions)
+        elif iou_type == 'keypoints':
+            return self.prepare_for_coco_keypoint(predictions)
+        else:
+            raise ValueError(f'Unknown iou type {iou_type}')
+
+    def prepare_for_coco_detection(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+            boxes = prediction['boxes']
+            boxes = convert_to_xywh(boxes).tolist()
+            scores = prediction['scores'].tolist()
+            labels = prediction['labels'].tolist()
+            labels = [self.ids2cats[i] for i in labels]
+
+            coco_results.extend(
+                [
+                    {
+                        'image_id': original_id,
+                        'category_id': labels[k],
+                        'bbox': box,
+                        'score': scores[k],
+                    }
+                    for k , box in enumerate(boxes)
+                ]
+            )
+        return coco_results
+
+    def prepare_for_coco_segmentation(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+            scores = prediction['scores'].tolist()
+            labels = prediction['labels'].tolist()
+            masks = prediction['masks']
+            masks = masks > 0.5
+
+            rles = [
+                mask_util.encode(np.array(mask[0, :, :, np.newaxis], dtype=np.uint8, order='F'))[0]
+                for mask in masks
+            ]
+            for rle in rles:
+                rle['counts'] = rle['counts'].decode('utf-8')
+
+            coco_results.extend(
+                [
+                    {
+                        'image_id': original_id,
+                        'category_id': labels[k],
+                        'segmentation': rle,
+                        'score': scores[k],
+                    }
+                    for k , rle in enumerate(rles)
+                ]
+            )
+        return coco_results
+
+
+    def prepare_for_coco_keypoint(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+            boxes = prediction['boxes']
+            boxes = convert_to_xywh(boxes).tolist()
+            scores = prediction['scores'].tolist()
+            labels = prediction['labels'].tolist()
+            keypoints = prediction['keypoints']
+            keypoints = keypoints.flatten(start_dim=1).tolist()
+
+            coco_results.extend(
+                [
+                    {
+                        'image_id': original_id,
+                        'category_id': labels[k],
+                        'keypoints': keypoint,
+                        'score': scores[k],
+                    }
+                    for k , keypoint in enumerate(keypoints)
+                ]
+            )
+        return coco_results
+
+
+def convert_to_xywh(boxes):
+    #xmin, ymin, xmax, ymax = boxes.unbind(1)
+    #return paddle.stack((xmin, ymin, xmax - xmin, ymax - ymin), axis=1)
+    xmin, ymin, xmax, ymax = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
+    return np.stack((xmin, ymin, xmax-xmin, ymax-ymin), axis=1)
+
+
+def merge(img_ids, eval_imgs):
+    #all_img_ids = [img_ids]
+    #all_eval_imgs = [eval_imgs]
+    all_img_ids = all_gather(img_ids)
+    all_eval_imgs = all_gather(eval_imgs)
+
+    merged_img_ids = []
+    for p in all_img_ids:
+        merged_img_ids.extend(p)
+
+    merged_eval_imgs = []
+    for p in all_eval_imgs:
+        merged_eval_imgs.append(p)
+
+    merged_img_ids = np.array(merged_img_ids)
+    merged_eval_imgs = np.concatenate(merged_eval_imgs, 2)
+
+    merged_img_ids, idx = np.unique(merged_img_ids, return_index=True)
+    merged_eval_imgs = merged_eval_imgs[..., idx]
+
+    return merged_img_ids, merged_eval_imgs
+
+
+def create_common_coco_eval(coco_eval, img_ids, eval_imgs):
+    img_ids, eval_imgs = merge(img_ids, eval_imgs)
+    img_ids = list(img_ids)
+    eval_imgs = list(eval_imgs.flatten())
+
+    coco_eval.evalImgs = eval_imgs
+    coco_eval.params.imgIds = img_ids
+    coco_eval._paramsEval = copy.deepcopy(coco_eval.params)
+
+
+#################################################################
+# From pycocotools, just removed the prints and fixed
+# a Python3 bug about unicode not defined
+#################################################################
+
+
+def evaluate(self):
+    '''
+    Run per image evaluation on given images and store results (a list of dict) in self.evalImgs
+    :return: None
+    '''
+    # tic = time.time()
+    # print('Running per image evaluation...')
+    p = self.params
+    # add backward compatibility if useSegm is specified in params
+    if p.useSegm is not None:
+        p.iouType = 'segm' if p.useSegm == 1 else 'bbox'
+        print('useSegm (deprecated) is not None. Running {} evaluation'.format(p.iouType))
+    # print('Evaluate annotation type *{}*'.format(p.iouType))
+    p.imgIds = list(np.unique(p.imgIds))
+    if p.useCats:
+        p.catIds = list(np.unique(p.catIds))
+    p.maxDets = sorted(p.maxDets)
+    self.params = p
+
+
+    self._prepare()
+    # loop through images, area range, max detection number
+    catIds = p.catIds if p.useCats else [-1]
+
+    if p.iouType == 'segm' or p.iouType == 'bbox':
+        computeIoU = self.computeIoU
+    elif p.iouType == 'keypoints':
+        computeIoU = self.computeOks
+    self.ious = {
+        (imgId, catId): computeIoU(imgId, catId)
+        for imgId in p.imgIds
+        for catId in catIds}
+
+    evaluateImg = self.evaluateImg
+    maxDet = p.maxDets[-1]
+    evalImgs = [
+        evaluateImg(imgId, catId, areaRng, maxDet)
+        for catId in catIds
+        for areaRng in p.areaRng
+        for imgId in p.imgIds
+    ]
+    # this is NOT in the pycocotools code, but could be done outside
+    evalImgs = np.asarray(evalImgs).reshape(len(catIds), len(p.areaRng), len(p.imgIds))
+    self._paramsEval = copy.deepcopy(self.params)
+    # toc = time.time()
+    # print('DONE (t={:0.2f}s).'.format(toc-tic))
+    return p.imgIds, evalImgs
+
+#################################################################
+# end of straight copy from pycocotools, just removing the prints
+#################################################################
diff --git a/object_detection/PVTv2/config.py b/object_detection/PVTv2/config.py
new file mode 100644
index 00000000..f482458e
--- /dev/null
+++ b/object_detection/PVTv2/config.py
@@ -0,0 +1,223 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 1 #1024 batch_size for single GPU
+_C.DATA.WEIGHT_PATH = './weights/pvtv2_b0_maskrcnn.pdparams' #"./weights/mask_rcnn_swin_small_patch4_window7.pdparams"
+_C.DATA.VAL_DATA_PATH = "/dataset/coco/" # path to dataset
+_C.DATA.DATASET = 'coco' # dataset name
+_C.DATA.IMAGE_SIZE = 640 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 1 # number of data loading threads
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'PVTv2_Det'
+_C.MODEL.NAME = 'pvtv2_maskrcnn_b0'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.0 # TODO: droppath may raise cuda error on paddle.rand method
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PRETRAIN_IMAGE_SIZE = 224
+_C.MODEL.TRANS.PATCH_SIZE = 4 # image_size = patch_size x window_size x num_windows
+_C.MODEL.TRANS.IN_CHANNELS = 3
+_C.MODEL.TRANS.EMBED_DIMS = [32, 64, 160, 256] 
+_C.MODEL.TRANS.STAGE_DEPTHS = [2, 2, 2, 2]   
+_C.MODEL.TRANS.NUM_HEADS = [1, 2, 5, 8]
+_C.MODEL.TRANS.MLP_RATIO = [8, 8, 4, 4]
+_C.MODEL.TRANS.SR_RATIO = [8, 4, 2, 1]
+_C.MODEL.TRANS.QKV_BIAS = True
+_C.MODEL.TRANS.QK_SCALE = None
+_C.MODEL.TRANS.LINEAR = False
+_C.MODEL.TRANS.OUT_INDICES = (0, 1, 2, 3)
+_C.MODEL.TRANS.FROZEN_STAGES = -1
+
+
+# fpn settings
+_C.FPN = CN()
+_C.FPN.OUT_CHANNELS = 256
+_C.FPN.IN_CHANNELS = [32, 64, 160, 256]  # [256, 512, 1024, 2048]
+_C.FPN.USE_C5 = False
+_C.FPN.STRIDES = [4, 8, 16, 32]
+
+# maskrcnn_head settings
+_C.RPN = CN()
+_C.ROI = CN()
+_C.ROI.BOX_HEAD = CN()
+
+_C.RPN.ANCHOR_SIZE = [[32], [64], [128], [256], [512]]
+_C.RPN.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+_C.RPN.STRIDES = [4, 8, 16, 32, 64]
+_C.RPN.OFFSET = 0.0
+_C.RPN.PRE_NMS_TOP_N_TRAIN = 2000
+_C.RPN.POST_NMS_TOP_N_TRAIN = 1000
+_C.RPN.PRE_NMS_TOP_N_TEST = 1000
+_C.RPN.POST_NMS_TOP_N_TEST = 1000
+_C.RPN.NMS_THRESH = 0.7
+_C.RPN.MIN_SIZE = 0.0
+_C.RPN.TOPK_AFTER_COLLECT = True
+_C.RPN.POSITIVE_THRESH = 0.7
+_C.RPN.NEGATIVE_THRESH = 0.3
+_C.RPN.BATCH_SIZE_PER_IMG = 256
+_C.RPN.POSITIVE_FRACTION = 0.5
+_C.RPN.LOW_QUALITY_MATCHES = True
+
+_C.ROI.SCORE_THRESH_INFER = 0.05
+_C.ROI.NMS_THRESH_INFER = 0.5
+_C.ROI.NMS_KEEP_TOPK_INFER = 100
+_C.ROI.NUM_ClASSES = 80
+_C.ROI.POSITIVE_THRESH = 0.5
+_C.ROI.NEGATIVE_THRESH = 0.5
+_C.ROI.BATCH_SIZE_PER_IMG = 512
+_C.ROI.POSITIVE_FRACTION = 0.25
+_C.ROI.LOW_QUALITY_MATCHES = False
+_C.ROI.BOX_HEAD.REG_WEIGHTS = [10.0, 10.0, 5.0, 5.0]
+_C.ROI.BOX_HEAD.NUM_CONV = 0
+_C.ROI.BOX_HEAD.CONV_DIM = 256
+_C.ROI.BOX_HEAD.NUM_FC = 2
+_C.ROI.BOX_HEAD.FC_DIM = 1024
+_C.ROI.SCALES = [1./4., 1./8., 1./16., 1./32, 1./64.]
+_C.ROI.ALIGN_OUTPUT_SIZE = 7
+_C.ROI.SAMPLING_RATIO = 0
+_C.ROI.CANONICAL_BOX_SIZE = 224
+_C.ROI.CANONICAL_LEVEL = 4
+_C.ROI.MIN_LEVEL = 0
+_C.ROI.MAX_LEVEL = 3
+_C.ROI.ALIGNED = True
+_C.ROI.PAT_GT_AS_PRO = True # when eval, set to False
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 0.0
+_C.TRAIN.END_LR = 0.0
+_C.TRAIN.GRAD_CLIP = 1.0
+_C.TRAIN.ACCUM_ITER = 2
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'SGD'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# augmentation
+_C.AUG = CN()
+_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
+_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
+_C.AUG.RE_PROB = 0.25 # random earse prob
+_C.AUG.RE_MODE = 'pixel' # random earse mode
+_C.AUG.RE_COUNT = 1 # random earse count
+_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
+_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
+_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 20 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.MODEL.LAST_EPOCH = args.last_epoch
+
+    #config.freeze()
+    return config
+
+
+def get_config(cfg_file=None):
+    """Return a clone config or load from yaml file"""
+    config = _C.clone()
+    if cfg_file:
+        _update_config_from_file(config, cfg_file)
+    return config
diff --git a/object_detection/PVTv2/configs/pvtv2_b0.yaml b/object_detection/PVTv2/configs/pvtv2_b0.yaml
new file mode 100644
index 00000000..b5042bcb
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b0.yaml
@@ -0,0 +1,20 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b0
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [32, 64, 160, 256]
+        STAGE_DEPTHS: [2, 2, 2, 2]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [8, 8, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.1
+FPN:
+    IN_CHANNELS: [32, 64, 160, 256]
+TRAIN:
+    GRAD_CLIP: None
+
diff --git a/object_detection/PVTv2/configs/pvtv2_b1.yaml b/object_detection/PVTv2/configs/pvtv2_b1.yaml
new file mode 100644
index 00000000..b99fd37c
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b1.yaml
@@ -0,0 +1,20 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b1
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [64, 128, 320, 512]
+        STAGE_DEPTHS: [2, 2, 2, 2]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [8, 8, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.1
+FPN:
+    IN_CHANNELS: [64, 128, 320, 512]
+TRAIN:
+    GRAD_CLIP: None
+
diff --git a/object_detection/PVTv2/configs/pvtv2_b2.yaml b/object_detection/PVTv2/configs/pvtv2_b2.yaml
new file mode 100644
index 00000000..bcbce330
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b2.yaml
@@ -0,0 +1,20 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b2
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [64, 128, 320, 512]
+        STAGE_DEPTHS: [3, 4, 6, 3]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [8, 8, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.1
+FPN:
+    IN_CHANNELS: [64, 128, 320, 512]
+TRAIN:
+    GRAD_CLIP: None
+
diff --git a/object_detection/PVTv2/configs/pvtv2_b2_linear.yaml b/object_detection/PVTv2/configs/pvtv2_b2_linear.yaml
new file mode 100644
index 00000000..cba6cb4f
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b2_linear.yaml
@@ -0,0 +1,21 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b2_linear
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [64, 128, 320, 512]
+        STAGE_DEPTHS: [3, 4, 6, 3]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [8, 8, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        LINEAR: True
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.1
+FPN:
+    IN_CHANNELS: [64, 128, 320, 512]
+TRAIN:
+    GRAD_CLIP: None
+
diff --git a/object_detection/PVTv2/configs/pvtv2_b3.yaml b/object_detection/PVTv2/configs/pvtv2_b3.yaml
new file mode 100644
index 00000000..f9fb848d
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b3.yaml
@@ -0,0 +1,20 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b3
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [64, 128, 320, 512]
+        STAGE_DEPTHS: [3, 4, 18, 3]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [8, 8, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.3
+FPN:
+    IN_CHANNELS: [64, 128, 320, 512]
+TRAIN:
+    GRAD_CLIP: 1.0
+
diff --git a/object_detection/PVTv2/configs/pvtv2_b4.yaml b/object_detection/PVTv2/configs/pvtv2_b4.yaml
new file mode 100644
index 00000000..3909fa38
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b4.yaml
@@ -0,0 +1,20 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b4
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [64, 128, 320, 512]
+        STAGE_DEPTHS: [3, 8, 27, 3]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [8, 8, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.3
+FPN:
+    IN_CHANNELS: [64, 128, 320, 512]
+TRAIN:
+    GRAD_CLIP: 1.0
+
diff --git a/object_detection/PVTv2/configs/pvtv2_b5.yaml b/object_detection/PVTv2/configs/pvtv2_b5.yaml
new file mode 100644
index 00000000..70e57d2b
--- /dev/null
+++ b/object_detection/PVTv2/configs/pvtv2_b5.yaml
@@ -0,0 +1,20 @@
+DATA:
+    IMAGE_SIZE: 224
+    CROP_PCT: 0.875
+MODEL:
+    TYPE: PVTv2
+    NAME: pvtv2_b5
+    TRANS:
+        PATCH_SIZE: 4
+        EMBED_DIMS: [64, 128, 320, 512]
+        STAGE_DEPTHS: [3, 6, 40, 3]
+        NUM_HEADS: [1, 2, 5, 8]
+        MLP_RATIO: [4, 4, 4, 4]
+        SR_RATIO: [8, 4, 2, 1]
+        QKV_BIAS: True
+    DROP_PATH: 0.0 #0.3
+FPN:
+    IN_CHANNELS: [64, 128, 320, 512]
+TRAIN:
+    GRAD_CLIP: 1.0
+
diff --git a/object_detection/PVTv2/det_heads/__init__.py b/object_detection/PVTv2/det_heads/__init__.py
new file mode 100644
index 00000000..16a69b52
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/__init__.py
@@ -0,0 +1,3 @@
+from . import maskrcnn_head
+from . import retinanet_head
+from . import det_utils
diff --git a/object_detection/PVTv2/det_heads/det_utils/box_utils.py b/object_detection/PVTv2/det_heads/det_utils/box_utils.py
new file mode 100644
index 00000000..4d97829f
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/det_utils/box_utils.py
@@ -0,0 +1,325 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+from paddle.fluid.framework import in_dygraph_mode
+from paddle.fluid import core
+from paddle.fluid.layer_helper import LayerHelper
+
+def bbox2delta(src_boxes, tgt_boxes, weights=[1.0, 1.0, 1.0, 1.0]):
+    '''
+    The function is used to compute two tensor boxes difference among (x, y, w, h).
+
+    Args:
+        src_boxes (tensor): shape [N, 4].
+        tgt_boxes (tensor): shape [N, 4].
+        weights (list[float]): balance the dx, dy, dw, dh.
+    
+    Returns:
+        deltas (tensor): shape[N, 4].
+    '''
+    src_w = src_boxes[:, 2] - src_boxes[:, 0]
+    src_h = src_boxes[:, 3] - src_boxes[:, 1]
+    src_ctr_x = src_boxes[:, 0] + 0.5 * src_w
+    src_ctr_y = src_boxes[:, 1] + 0.5 * src_h
+
+    tgt_w = tgt_boxes[:, 2] - tgt_boxes[:, 0]
+    tgt_h = tgt_boxes[:, 3] - tgt_boxes[:, 1]
+    tgt_ctr_x = tgt_boxes[:, 0] + 0.5 * tgt_w
+    tgt_ctr_y = tgt_boxes[:, 1] + 0.5 * tgt_h
+
+    wx, wy, ww, wh = weights
+    dx = wx * (tgt_ctr_x - src_ctr_x) / src_w
+    dy = wy * (tgt_ctr_y - src_ctr_y) / src_h
+    dw = ww * paddle.log(tgt_w / src_w)
+    dh = wh * paddle.log(tgt_h / src_h)
+
+    deltas = paddle.stack((dx, dy, dw, dh), axis=1)
+    return deltas
+
+
+def delta2bbox(deltas, boxes, weights=[1.0, 1.0, 1.0, 1.0]):
+    '''
+    The inverse process of bbox2delta.
+    '''
+    clip_scale = math.log(1000.0 / 16)
+
+    widths = boxes[:, 2] - boxes[:, 0]
+    heights = boxes[:, 3] - boxes[:, 1]
+    ctr_x = boxes[:, 0] + 0.5 * widths
+    ctr_y = boxes[:, 1] + 0.5 * heights
+
+    wx, wy, ww, wh = weights
+    dx = deltas[:, 0::4] / wx
+    dy = deltas[:, 1::4] / wy
+    dw = deltas[:, 2::4] / ww
+    dh = deltas[:, 3::4] / wh
+    # Prevent sending too large values into paddle.exp()
+    dw = paddle.clip(dw, max=clip_scale)
+    dh = paddle.clip(dh, max=clip_scale)
+
+    pred_ctr_x = dx * widths.unsqueeze(1) + ctr_x.unsqueeze(1)
+    pred_ctr_y = dy * heights.unsqueeze(1) + ctr_y.unsqueeze(1)
+    pred_w = paddle.exp(dw) * widths.unsqueeze(1)
+    pred_h = paddle.exp(dh) * heights.unsqueeze(1)
+
+    pred_boxes = []
+    pred_boxes.append(pred_ctr_x - 0.5 * pred_w)
+    pred_boxes.append(pred_ctr_y - 0.5 * pred_h)
+    pred_boxes.append(pred_ctr_x + 0.5 * pred_w)
+    pred_boxes.append(pred_ctr_y + 0.5 * pred_h)
+    pred_boxes = paddle.stack(pred_boxes, axis=-1)
+
+    return pred_boxes
+
+
+def boxes_area(boxes):
+    '''
+    Compute boxes area.
+
+    Args:
+        boxes (tensor):  shape [M, 4] | [N, M, 4].
+
+    Returns:
+        areas (tensor): shape [M] | [N, M].
+    '''
+    assert boxes.shape[-1] == 4
+    if boxes.dim() == 2:
+        boxes_wh = boxes[:, 2:] - boxes[:, :2]
+        return (boxes_wh[:, 0] * boxes_wh[:, 1]).clip(min=0)
+
+    elif boxes.dim() == 3:
+        boxes_wh = boxes[:, :, 2:] - boxes[:, :, :2]
+        return (boxes_wh[:, :, 0] * boxes_wh[:, :, 1]).clip(min=0)
+
+    else:
+        raise ValueError("The dim of boxes must be 2 or 3!")
+
+
+def boxes_iou(boxes1, boxes2, mode='a'):
+    '''
+    Compute the ious of two boxes tensor and the coordinate format of boxes is xyxy.
+
+    Args:
+        boxes1 (tensor): when mode == 'a': shape [M, 4];  when mode == 'b': shape [M, 4]
+        boxes2 (tensor): when mode == 'a': shape [R, 4];  when mode == 'b': shape [M, 4]
+        mode (string | 'a' or 'b'): when mode == 'a': compute one to many;
+                                    when mode == 'b': compute one to one.
+
+    Returns:
+        ious (tensor): when mode == 'a': shape [M, R];  when mode == 'b': shape [M]
+    '''
+    area1 = boxes_area(boxes1)
+    area2 = boxes_area(boxes2)
+
+    if mode == 'a':
+        lt = paddle.maximum(boxes1.unsqueeze(-2)[:, :, :2], boxes2.unsqueeze(0)[:, :, :2])
+        rb = paddle.minimum(boxes1.unsqueeze(-2)[:, :, 2:], boxes2.unsqueeze(0)[:, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1]
+
+        union_area = area1.unsqueeze(-1) + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+
+    elif mode == 'b':
+        assert boxes1.shape[0] == boxes2.shape[0]
+
+        lt = paddle.maximum(boxes1[:, :2], boxes2[:, :2])
+        rb = paddle.minimum(boxes1[:, 2:], boxes2[:, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, 0] * inter_wh[:, 1]
+
+        union_area = area1 + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+        
+    else:
+        raise ValueError("Only support mode 'a' or 'b'")
+
+
+def batch_iou(boxes1, boxes2, mode='a'):
+    '''
+    Compute the ious of two boxes tensor and the coordinate format of boxes is xyxy.
+
+    Args:
+        boxes1 (tensor): when mode == 'a': shape [N, M, 4];  when mode == 'b': shape [N, M, 4]
+        boxes2 (tensor): when mode == 'a': shape [N, R, 4];  when mode == 'b': shape [N, M, 4]
+        mode (string | 'a' or 'b'): when mode == 'a': compute one to many;
+        when mode == 'b': compute one to one
+
+    Returns:
+        ious (tensor): when mode == 'a': shape [N, M, R];  when mode == 'b': shape [N, M]
+    '''
+    area1 = boxes_area(boxes1)
+    area2 = boxes_area(boxes2)
+
+    if mode == 'a':
+        lt = paddle.maximum(boxes1.unsqueeze(-2)[:, :, :, :2], boxes2.unsqueeze(1)[:, :, :, :2])
+        rb = paddle.minimum(boxes1.unsqueeze(-2)[:, :, :, 2:], boxes2.unsqueeze(1)[:, :, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, :, 0] * inter_wh[:, :, :, 1]
+
+        union_area = area1.unsqueeze(-1) + area2.unsqueeze(-2) - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+
+    elif mode == 'b':
+        assert boxes1.shape[0] == boxes2.shape[0]
+
+        lt = paddle.maximum(boxes1[:, :, :2], boxes2[:, :, :2])
+        rb = paddle.minimum(boxes1[:, :, 2:], boxes2[:, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1]
+
+        union_area = area1 + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+    else:
+        raise ValueError("Only support mode 'a' or 'b'")
+
+
+def nonempty_bbox(boxes, min_size=0, return_mask=False):
+    w = boxes[:, 2] - boxes[:, 0]
+    h = boxes[:, 3] - boxes[:, 1]
+    mask = paddle.logical_and(h > min_size, w > min_size)
+    if return_mask:
+        return mask
+    keep = paddle.nonzero(mask).flatten()
+    return keep
+
+
+def multiclass_nms(bboxes,
+                   scores,
+                   score_threshold,
+                   keep_top_k,
+                   nms_top_k=-1,
+                   nms_threshold=0.3,
+                   normalized=True,
+                   nms_eta=1.,
+                   background_label=-1,
+                   return_index=False,
+                   return_rois_num=True,
+                   rois_num=None,
+                   name=None):
+    """
+    This operator is to do multi-class non maximum suppression (NMS) on
+    boxes and scores.
+    In the NMS step, this operator greedily selects a subset of detection bounding
+    boxes that have high scores larger than score_threshold, if providing this
+    threshold, then selects the largest nms_top_k confidences scores if nms_top_k
+    is larger than -1. Then this operator pruns away boxes that have high IOU
+    (intersection over union) overlap with already selected boxes by adaptive
+    threshold NMS based on parameters of nms_threshold and nms_eta.
+    Aftern NMS step, at most keep_top_k number of total bboxes are to be kept
+    per image if keep_top_k is larger than -1.
+    Args:
+        bboxes (tensor): Two types of bboxes are supported:
+                           1. (tensor) A 3-D Tensor with shape
+                           [N, M, 4 or 8 16 24 32] represents the
+                           predicted locations of M bounding bboxes,
+                           N is the batch size. Each bounding box has four
+                           coordinate values and the layout is
+                           [xmin, ymin, xmax, ymax], when box size equals to 4.
+                           2. (tensor) A 3-D Tensor with shape [M, C, 4]
+                           M is the number of bounding boxes, C is the
+                           class number
+        scores (tensor): Two types of scores are supported:
+                           1. (tensor) A 3-D Tensor with shape [N, C, M]
+                           represents the predicted confidence predictions.
+                           N is the batch size, C is the class number, M is
+                           number of bounding boxes. For each category there
+                           are total M scores which corresponding M bounding
+                           boxes. Please note, M is equal to the 2nd dimension
+                           of BBoxes.
+                           2. (LoDTensor) A 2-D LoDTensor with shape [M, C].
+                           M is the number of bbox, C is the class number.
+                           In this case, input BBoxes should be the second
+                           case with shape [M, C, 4].
+        background_label (int): The index of background label, the background
+                                label will be ignored. If set to -1, then all
+                                categories will be considered. Default: 0
+        score_threshold (float): Threshold to filter out bounding boxes with
+                                 low confidence score. If not provided,
+                                 consider all boxes.
+        nms_top_k (int): Maximum number of detections to be kept according to
+                         the confidences after the filtering detections based
+                         on score_threshold.
+        nms_threshold (float): The threshold to be used in NMS. Default: 0.3
+        nms_eta (float): The threshold to be used in NMS. Default: 1.0
+        keep_top_k (int): Number of total bboxes to be kept per image after NMS
+                          step. -1 means keeping all bboxes after NMS step.
+        normalized (bool): Whether detections are normalized. Default: True
+        return_index(bool): Whether return selected index. Default: False
+        rois_num(Tensor): 1-D Tensor contains the number of RoIs in each image. 
+            The shape is [B] and data type is int32. B is the number of images.
+            If it is not None then return a list of 1-D Tensor. Each element 
+            is the output RoIs' number of each image on the corresponding level
+            and the shape is [B]. None by default.
+        name(str): Name of the multiclass nms op. Default: None.
+
+    Returns:
+        A tuple with two Variables: (Out, Index) if return_index is True,
+        otherwise, a tuple with one Variable(Out) is returned.
+        Out: A 2-D LoDTensor with shape [No, 6] represents the detections.
+        Each row has 6 values: [label, confidence, xmin, ymin, xmax, ymax]
+        or A 2-D LoDTensor with shape [No, 10] represents the detections.
+        Each row has 10 values: [label, confidence, x1, y1, x2, y2, x3, y3,
+        x4, y4]. No is the total number of detections.
+        If all images have not detected results, all elements in LoD will be
+        0, and output tensor is empty (None).
+        Index: Only return when return_index is True. A 2-D LoDTensor with
+        shape [No, 1] represents the selected index which type is Integer.
+        The index is the absolute value cross batches. No is the same number
+        as Out. If the index is used to gather other attribute such as age,
+        one needs to reshape the input(N, M, 1) to (N * M, 1) as first, where
+        N is the batch size and M is the number of boxes.
+    """
+    helper = LayerHelper('multiclass_nms3', **locals())
+
+    if in_dygraph_mode():
+        attrs = ('background_label', background_label, 'score_threshold',
+                 score_threshold, 'nms_top_k', nms_top_k, 'nms_threshold',
+                 nms_threshold, 'keep_top_k', keep_top_k, 'nms_eta', nms_eta,
+                 'normalized', normalized)
+
+        output, index, nms_rois_num = core.ops.multiclass_nms3(bboxes, scores,
+                                                               rois_num, *attrs)
+        if not return_index:
+            index = None
+
+        return output, nms_rois_num, index
\ No newline at end of file
diff --git a/object_detection/PVTv2/det_heads/det_utils/generator_utils.py b/object_detection/PVTv2/det_heads/det_utils/generator_utils.py
new file mode 100644
index 00000000..092c620a
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/det_utils/generator_utils.py
@@ -0,0 +1,500 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+import paddle.nn as nn
+from paddle.fluid.framework import Variable, in_dygraph_mode
+from paddle.fluid import core
+
+class AnchorGenerator(nn.Layer):
+    """
+    Compute anchors in the standard ways described in
+    "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks".
+
+    Attributes:
+        anchor_size (list[list[float]] | list[float]):
+            If ``anchor_size`` is list[list[float]], ``anchor_size[i]`` is the list of anchor sizes
+            (i.e. sqrt of anchor area) to use for the i-th feature map.
+            If ``anchor_size`` is list[float], ``anchor_size`` is used for all feature maps.
+            Anchor anchor_size are given in absolute lengths in units of
+            the input image; they do not dynamically scale if the input image size changes.
+        aspect_ratios (list[list[float]] or list[float]): list of aspect ratios
+            (i.e. height / width) to use for anchors. Same "broadcast" rule for `sizes` applies.
+        strides (list[int]): stride of each input feature.
+        offset (float): Relative offset between the center of the first anchor and the top-left
+            corner of the image. Value has to be in [0, 1).
+            Recommend to use 0.5, which means half stride.
+    """
+
+    def __init__(self, 
+                 anchor_sizes = [[32], [64], [128], [256], [512]],
+                 aspect_ratios = [0.5, 1.0, 2.0],
+                 strides = [4, 8, 16, 32, 64],
+                 offset = 0.5):
+        super(AnchorGenerator, self).__init__()
+
+        self.anchor_sizes = anchor_sizes
+        self.aspect_ratios = aspect_ratios
+        self.strides = strides
+        self.offset = offset
+        self.base_anchors = self._compute_anchors()
+
+        assert 0. <= self.offset <= 1.0
+
+    def generate_anchors(self, 
+                        sizes = [32, 64, 128, 256, 512], 
+                        aspect_ratios = [0.5, 1.0, 2.0]):
+        """
+        Generate a tensor storing canonical anchor boxes, which are all anchor
+        boxes of different sizes and aspect_ratios centered at (0, 0).
+        We can later build the set of anchors for a full feature map by
+        shifting and tiling these tensors (see `meth:_grid_anchors`).
+        Args:
+            sizes (list[float] | tuple[float]):
+            aspect_ratios (list[float] | tuple[float]]):
+        Returns:
+            Tensor of shape (len(sizes) * len(aspect_ratios), 4) storing anchor boxes
+                in xyxy format.
+        """
+        anchors = []
+        
+        for size in sizes:
+            area = size ** 2.0
+            for ratio in aspect_ratios:
+                w = math.sqrt(area / ratio)
+                h = ratio * w
+                x0, y0, x1, y1 = -w / 2.0, -h / 2.0, w / 2.0, h / 2.0
+                anchors.append([x0, y0, x1, y1])
+        
+        return paddle.to_tensor(anchors, dtype='float32')
+    
+    def _broadcast_params(self, params, num_features):
+        if not isinstance(params[0], (list, tuple)):
+            return [params] * num_features
+        if len(params) == 1:
+            return params * num_features
+        return params
+        
+    def _compute_anchors(self):
+        sizes = self._broadcast_params(self.anchor_sizes, len(self.strides))
+        aspect_ratios = self._broadcast_params(self.aspect_ratios, len(self.strides))
+
+        base_anchors = [self.generate_anchors(s, a) for s, a in zip(sizes, aspect_ratios)]
+
+        [self.register_buffer(t.name, t, persistable=False) for t in base_anchors]
+
+        return base_anchors
+
+    def _grid_anchors(self, grid_sizes):
+        anchors = []
+
+        for grid_size, stride, base_anchor in zip(grid_sizes, self.strides, self.base_anchors):
+            grid_h, grid_w = grid_size
+
+            grid_x = paddle.arange(
+                self.offset * stride, grid_w * stride, step = stride, dtype='float32'
+            )
+            grid_y = paddle.arange(
+                self.offset * stride, grid_h * stride, step = stride, dtype='float32'
+            )
+
+            grid_y, grid_x = paddle.meshgrid(grid_y, grid_x)
+            grid_x = grid_x.reshape([-1])
+            grid_y = grid_y.reshape([-1])
+
+            grid_coord = paddle.stack([grid_x, grid_y, grid_x, grid_y], axis=1)
+
+            anchors.append((grid_coord.unsqueeze(1) + base_anchor.unsqueeze(0)).reshape([-1, 4]))
+
+        return anchors
+    
+    def forward(self, feats):
+        grid_sizes = [feat.shape[-2:] for feat in feats]
+        anchor_over_all_feat_maps = self._grid_anchors(grid_sizes)
+
+        return anchor_over_all_feat_maps
+    
+    @property
+    def num_anchors(self):
+        return [len(num_a) for num_a in self.base_anchors][0]
+
+# feats = []
+# h, w = 800., 800
+# for i in range(4):
+#     feats.append(paddle.rand([4, 256, h / (2 ** (i + 2)), w / (2 ** (i + 2))]))
+
+# anchorgenerator = AnchorGenerator()
+# res = anchorgenerator(feats)
+# print(anchorgenerator.num_anchors)
+# print(res)
+def generate_proposals(scores,
+                       bbox_deltas,
+                       im_shape,
+                       anchors,
+                       variances,
+                       pre_nms_top_n=6000,
+                       post_nms_top_n=1000,
+                       nms_thresh=0.5,
+                       min_size=0.1,
+                       eta=1.0,
+                       pixel_offset=False,
+                       return_rois_num=False,
+                       name=None):
+    """
+    **Generate proposal Faster-RCNN**
+    This operation proposes RoIs according to each box with their
+    probability to be a foreground object and 
+    the box can be calculated by anchors. Bbox_deltais and scores
+    to be an object are the output of RPN. Final proposals
+    could be used to train detection net.
+    For generating proposals, this operation performs following steps:
+    1. Transposes and resizes scores and bbox_deltas in size of
+       (H*W*A, 1) and (H*W*A, 4)
+    2. Calculate box locations as proposals candidates. 
+    3. Clip boxes to image
+    4. Remove predicted boxes with small area. 
+    5. Apply NMS to get final proposals as output.
+
+    Args:
+        scores (tensor): A 4-D Tensor with shape [N, A, H, W] represents
+            the probability for each box to be an object.
+            N is batch size, A is number of anchors, H and W are height and
+            width of the feature map. The data type must be float32.
+        bbox_deltas (tensor): A 4-D Tensor with shape [N, 4*A, H, W]
+            represents the difference between predicted box location and
+            anchor location. The data type must be float32.
+        im_shape (tensor): A 2-D Tensor with shape [N, 2] represents H, W, the
+            origin image size or input size. The data type can be float32 or 
+            float64.
+        anchors (tensor): A 4-D Tensor represents the anchors with a layout
+            of [H, W, A, 4] or [H * W * A, 4]. H and W are height and width of the feature map,
+            num_anchors is the box count of each position. Each anchor is
+            in (xmin, ymin, xmax, ymax) format an unnormalized. The data type must be float32.
+        variances (tensor): A 4-D Tensor. The expanded variances of anchors with a layout of
+            [H, W, num_priors, 4]. Each variance is in (xcenter, ycenter, w, h) format. 
+            The data type must be float32.
+        pre_nms_top_n (float): Number of total bboxes to be kept per image before NMS. 
+            The data type must be float32. `6000` by default.
+        post_nms_top_n (float): Number of total bboxes to be kept per image after NMS. The data type must be float32. 
+            `1000` by default.
+        nms_thresh (float): Threshold in NMS. The data type must be float32. `0.5` by default.
+        min_size (float): Remove predicted boxes with either height or
+            width < min_size. The data type must be float32. `0.1` by default.
+        eta (float): Apply in adaptive NMS, if adaptive `threshold > 0.5`,
+            `adaptive_threshold = adaptive_threshold * eta` in each iteration.
+        return_rois_num (bool): When setting True, it will return a 1D Tensor with shape [N, ] that includes Rois's 
+            num of each image in one batch. The N is the image's num. For example, the tensor has values [4,5] that represents
+            the first image has 4 Rois, the second image has 5 Rois. It only used in rcnn model. 
+            'False' by default. 
+        name(str, optional): For detailed information, please refer 
+            to :ref:`api_guide_Name`. Usually name is no need to set and 
+            None by default. 
+    Returns:
+        tuple:
+        A tuple with format ``(rpn_rois, rpn_roi_probs)``.
+        - **rpn_rois**: The generated RoIs. 2-D Tensor with shape ``[N, 4]`` while ``N`` is the number of RoIs. 
+            The data type is the same as ``scores``.
+        - **rpn_roi_probs**: The scores of generated RoIs. 2-D Tensor with shape ``[N, 1]`` while ``N`` is the number of RoIs. 
+            The data type is the same as ``scores``.
+    """
+    assert in_dygraph_mode()
+    assert return_rois_num, "return_rois_num should be True in dygraph mode."
+    attrs = ('pre_nms_topN', pre_nms_top_n, 'post_nms_topN', post_nms_top_n,
+                'nms_thresh', nms_thresh, 'min_size', min_size, 'eta', eta,
+                'pixel_offset', pixel_offset)
+    rpn_rois, rpn_roi_probs, rpn_rois_num = core.ops.generate_proposals_v2(
+        scores, bbox_deltas, im_shape, anchors, variances, *attrs)
+
+    return rpn_rois, rpn_roi_probs, rpn_rois_num
+
+
+class ProposalGenerator(object):
+    """
+    For each feature map, select the `pre_nms_topk` highest scoring proposals,
+    apply NMS, clip proposals, and remove small boxes. Return the `post_nms_topk`
+    highest scoring proposals among all the feature maps for each image.
+
+    Attributes:
+        pre_nms_top_n (int): number of top k scoring proposals to keep before applying NMS.
+            When RPN is run on multiple feature maps (as in FPN) this number is per
+            feature map.Default 6000
+        post_nms_top_n (int): number of top k scoring proposals to keep after applying NMS.
+            When RPN is run on multiple feature maps (as in FPN) this number is total,
+            over all feature maps.Default 1000
+        nms_thresh (float): Threshold in NMS. default 0.5
+        min_size (float): minimum proposal box side length in pixels (absolute units
+            wrt input images).
+        eta (float): Apply in adaptive NMS, if adaptive `threshold > 0.5`,
+             `adaptive_threshold = adaptive_threshold * eta` in each iteration.
+             default 1.
+        topk_after_collect (bool): whether to adopt topk after batch 
+             collection. If topk_after_collect is true, box filter will not be 
+             used after NMS at each image in proposal generation. default false
+    """
+
+    def __init__(self,
+                 pre_nms_top_n = 6000,
+                 post_nms_top_n = 1000,
+                 nms_thresh = .5,
+                 min_size = .1,
+                 eta = 1.,
+                 topk_after_collect = False):
+        super(ProposalGenerator, self).__init__()
+        self.pre_nms_top_n = pre_nms_top_n
+        self.post_nms_top_n = post_nms_top_n
+        self.nms_thresh = nms_thresh
+        self.min_size = min_size
+        self.eta = eta
+        self.topk_after_collect = topk_after_collect
+
+    def __call__(self, scores, bbox_deltas, anchors, imgs_shape):
+        top_n = self.pre_nms_top_n if self.topk_after_collect else self.post_nms_top_n
+        variances = paddle.ones_like(anchors)
+        rpn_rois, rpn_rois_prob, rpn_rois_num = generate_proposals(
+            scores,
+            bbox_deltas,
+            imgs_shape,
+            anchors,
+            variances,
+            pre_nms_top_n=self.pre_nms_top_n,
+            post_nms_top_n=top_n,
+            nms_thresh=self.nms_thresh,
+            min_size=self.min_size,
+            eta=self.eta,
+            return_rois_num=True
+        )
+
+        return rpn_rois, rpn_rois_prob, rpn_rois_num, self.post_nms_top_n  
+
+
+def roi_align(input,
+              rois,
+              output_size,
+              spatial_scale=1.0,
+              sampling_ratio=-1,
+              rois_num=None,
+              aligned=True):
+    """
+    Region of interest align (also known as RoI align) is to perform
+    bilinear interpolation on inputs of nonuniform sizes to obtain 
+    fixed-size feature maps (e.g. 7*7).
+
+    Args:
+        input (Tensor): Input feature, 4D-Tensor with the shape of [N,C,H,W], 
+            where N is the batch size, C is the input channel, H is Height, W is weight. 
+            The data type is float32 or float64.
+        rois (Tensor): ROIs (Regions of Interest) to pool over.It should be
+            a 2-D Tensor or 2-D LoDTensor of shape (num_rois, 4), the lod level is 1. 
+            The data type is float32 or float64. Given as [[x1, y1, x2, y2], ...],
+            (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates.
+        output_size (list[int, int] | tuple[int, int]): The pooled output size(h, w), data type is int32.
+        spatial_scale (list[float32], optional): Multiplicative spatial scale factor to translate ROI coords 
+            from their input scale to the scale used when pooling. Default: 1.0
+        sampling_ratio(int32, optional): number of sampling points in the interpolation grid. 
+            If <=0, then grid points are adaptive to roi_width and pooled_w, likewise for height. Default: -1
+        rois_num (Tensor): The number of RoIs in each image. Default: None
+        name(str, optional): For detailed information, please refer
+            to :ref:`api_guide_Name`. Usually name is no need to set and
+            None by default.
+
+    Returns:
+        Tensor:
+        Output: The output of ROIAlignOp is a 4-D tensor with shape (num_rois, channels, pooled_h, pooled_w).
+            The data type is float32 or float64.
+    """
+
+    if isinstance(output_size, int):
+        output_size = (output_size, output_size)
+
+    pooled_height, pooled_width = output_size
+
+    if in_dygraph_mode():
+        assert rois_num is not None, "rois_num should not be None in dygraph mode."
+        align_out = core.ops.roi_align(
+            input, rois, rois_num, "pooled_height", pooled_height,
+            "pooled_width", pooled_width, "spatial_scale", spatial_scale,
+            "sampling_ratio", sampling_ratio, "aligned", aligned)
+
+        return align_out
+
+
+def distribute_fpn_proposals(fpn_rois,
+                             min_level,
+                             max_level,
+                             refer_level,
+                             refer_scale,
+                             pixel_offset=False,
+                             rois_num=None):
+    """
+    
+    **This op only takes LoDTensor as input.** In Feature Pyramid Networks 
+    (FPN) models, it is needed to distribute all proposals into different FPN 
+    level, with respect to scale of the proposals, the referring scale and the 
+    referring level. Besides, to restore the order of proposals, we return an 
+    array which indicates the original index of rois in current proposals. 
+
+    Args:
+        fpn_rois(tensor): 2-D Tensor with shape [N, 4] and data type is 
+            float32 or float64. The input fpn_rois.
+        min_level(int32): The lowest level of FPN layer where the proposals come 
+            from.
+        max_level(int32): The highest level of FPN layer where the proposals
+            come from.
+        refer_level(int32): The referring level of FPN layer with specified scale.
+        refer_scale(int32): The referring scale of FPN layer with specified level.
+        rois_num(tensor): 1-D Tensor contains the number of RoIs in each image. 
+            The shape is [B] and data type is int32. B is the number of images.
+            If it is not None then return a list of 1-D Tensor. Each element 
+            is the output RoIs' number of each image on the corresponding level
+            and the shape is [B]. None by default. 
+
+    Returns:
+        Tuple:
+        multi_rois(list[tensor]) : A list of 2-D LoDTensor with shape [M, 4] 
+        and data type of float32 and float64. The length is 
+        max_level-min_level+1. The proposals in each FPN level.
+        restore_ind(tensor): A 2-D Tensor with shape [N, 1], N is 
+        the number of total rois. The data type is int32. It is
+        used to restore the order of fpn_rois.
+        rois_num_per_level(list(tensor)): A list of 1-D Tensor and each Tensor is 
+        the RoIs' number in each image on the corresponding level. The shape 
+        is [B] and data type of int32. B is the number of images.
+
+    """
+    num_lvl = max_level - min_level + 1
+
+    if in_dygraph_mode():
+        assert rois_num is not None, "rois_num should not be None in dygraph mode."
+        attrs = ('min_level', min_level, 'max_level', max_level, 'refer_level',
+                 refer_level, 'refer_scale', refer_scale, 'pixel_offset',
+                 pixel_offset)
+        multi_rois, restore_ind, rois_num_per_level = core.ops.distribute_fpn_proposals(
+            fpn_rois, rois_num, num_lvl, num_lvl, *attrs)
+
+        return multi_rois, restore_ind, rois_num_per_level
+
+
+class RoIAlign(object):
+    '''
+    Region of interest feature map pooler that supports pooling from 
+    one or more feature maps.
+    '''
+    def __init__(
+        self,
+        output_size,
+        scales,
+        sampling_ratio,
+        canonical_box_size=224,
+        canonical_level=4,
+        min_level=0,
+        max_level=3,
+        aligned=True
+    ):
+        '''
+        Attributes:
+            output_size (int): output size of the pooled region.
+            scales (list[float]): The scale for each low-level pooling op relative to
+                the input image. For a feature map with stride s relative to the input
+                image, scale is defined as 1/s. The stride must be power of 2.
+                When there are multiple scales, they must form a pyramid, i.e. they must be
+                a monotically decreasing geometric sequence with a factor of 1/2.
+            sampling_ratio (int): The `sampling_ratio` parameter for the ROIAlign op.
+            canonical_box_size (int): A canonical box size in pixels (sqrt(box area)). The default
+                is heuristically defined as 224 pixels in the FPN paper (based on ImageNet
+                pre-training).
+            canonical_level (int): The feature map level index from which a canonically-sized box
+                should be placed. The default is defined as level 4 (stride=16) in the FPN paper,
+                i.e., a box of size 224x224 will be placed on the feature with stride=16.
+                The box placement for all boxes will be determined from their sizes w.r.t
+                canonical_box_size. For example, a box whose area is 4x that of a canonical box
+                should be used to pool features from feature level ``canonical_level+1``.
+                Note that the actual input feature maps given to this module may not have
+                sufficiently many levels for the input boxes. If the boxes are too large or too
+                small for the input feature maps, the closest level will be used.
+            start_level (int): The start level of FPN layer to extract RoI feature, default 0.
+            end_level (int): The end level of FPN layer to extract RoI feature, default 3.
+            aligned (bool): Whether to add offset to rois' coord in roi_align. default True.
+        '''
+        super(RoIAlign, self).__init__()
+        self.output_size = output_size
+        self.scales = scales
+        self.sampling_ratio = sampling_ratio
+        self.canonical_box_size = canonical_box_size
+        self.canonical_level = canonical_level
+        self.min_level = min_level
+        self.max_level = max_level
+        self.aligned = aligned
+    
+    def __call__(self, feats, rois, rois_num):
+        '''
+        Args:
+            feats (list[tensor]): features from fpn.
+            rois (list[tensor]): proposals from rpn.
+            rois_num (list[int]): the number of each img's proposals.
+        
+        Returns:
+            roi_features (tensor): A tensor of shape (M, C, output_size, output_size)
+            where M is the total number of boxes aggregated over all N batch images
+            and C is the number of channels in `x`.
+        '''
+        if isinstance(rois_num, list):
+            rois_num = paddle.to_tensor(rois_num).astype("int32")
+        rois = paddle.concat(rois)
+
+        if len(feats) == 1:
+            roi_features = roi_align(
+                feats[self.min_level],
+                rois,
+                self.output_size,
+                self.scales[0],
+                self.sampling_ratio,
+                rois_num=rois_num,
+                aligned=self.aligned
+            )
+
+        else:
+            rois_per_level, original_ind, rois_num_per_level = distribute_fpn_proposals(
+                rois,
+                self.min_level + 2,
+                self.max_level + 2,
+                self.canonical_level,
+                self.canonical_box_size,
+                rois_num=rois_num
+            )
+
+            roi_features_per_level = []
+
+            for l in range(self.min_level, self.max_level + 1):
+                roi_feats = roi_align(
+                    feats[l],
+                    rois_per_level[l],
+                    self.output_size,
+                    self.scales[l],
+                    self.sampling_ratio,
+                    rois_num=rois_num_per_level[l],
+                    aligned = self.aligned
+                )
+
+                roi_features_per_level.append(roi_feats)
+            
+            roi_features = paddle.gather(
+                paddle.concat(roi_features_per_level),
+                original_ind
+            )
+        
+        return roi_features
+
diff --git a/object_detection/PVTv2/det_heads/det_utils/target_assign.py b/object_detection/PVTv2/det_heads/det_utils/target_assign.py
new file mode 100644
index 00000000..05f52019
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/det_utils/target_assign.py
@@ -0,0 +1,304 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+from .box_utils import boxes_iou, bbox2delta
+
+def anchor_target_matcher(match_quality_matrix, 
+                          positive_thresh,
+                          negative_thresh,
+                          allow_low_quality_matches,
+                          low_thresh = -float("inf")):
+    '''
+    This class assigns to each predicted "element" (e.g., a box) a ground-truth
+    element. Each predicted element will have exactly zero or one matches; each
+    ground-truth element may be matched to zero or more predicted elements.
+
+    Args:
+        match_quality_matrix (tensor): an MxN tensor, containing the pairwise quality 
+            between M ground-truth elements and N predicted elements.
+        positive_thresh (float): the positive class threshold of iou between anchors and gt.
+        negative_thresh (float): the negative class threshold of iou between anchors and gt.
+        allow_low_quality_matches (bool): if True, produce additional matches
+            for predictions with maximum match quality lower than high_threshold.
+    
+    Returns:
+        matches (tensor): a vector of length M, where matches[i] is a matched
+            ground-truth index in [0, M).
+        match_labels (tensor): a vector of length M, where pred_labels[i] indicates
+            whether a prediction is a true or false positive or ignored.
+        
+    '''
+    # matches is 1 x M, the index of anchors matching gt
+    matched_vals, matches = paddle.topk(match_quality_matrix, k = 1, axis = 0)
+    match_labels = paddle.full(matches.shape, -1, dtype = "int32")
+    neg_idx = paddle.logical_and(matched_vals > low_thresh,
+                                 matched_vals < negative_thresh)
+
+    match_labels = paddle.where(matched_vals >= positive_thresh,
+                                paddle.ones_like(match_labels), 
+                                match_labels)
+    match_labels = paddle.where(neg_idx,
+                                paddle.zeros_like(match_labels), 
+                                match_labels)
+
+    # highest_quality_foreach_gt is N x 1
+    # For each gt, find the prediction with which it has highest quality
+    if allow_low_quality_matches:
+        highest_quality_foreach_gt = match_quality_matrix.max(axis=1, keepdim=True)
+        pred_inds_with_highest_quality = paddle.logical_and(
+        match_quality_matrix > 0, match_quality_matrix == highest_quality_foreach_gt).cast('int32').sum(
+            0, keepdim=True)
+        match_labels = paddle.where(pred_inds_with_highest_quality > 0,
+                                    paddle.ones_like(match_labels),
+                                    match_labels)
+
+    matches = matches.flatten()
+    match_labels = match_labels.flatten()
+
+    return matches, match_labels
+
+
+# reference: https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/sampling.py
+def subsample_labels(labels,
+                     num_samples,
+                     positive_fraction,
+                     bg_label=0):
+    """
+    Return `num_samples` (or fewer, if not enough found)
+    random samples from `labels` which is a mixture of positives & negatives.
+    It will try to return as many positives as possible without
+    exceeding `positive_fraction * num_samples`, and then try to
+    fill the remaining slots with negatives.
+
+    Args:
+        labels (tensor): shape (N, ) label vector with values:
+            * -1: ignore
+            * bg_label: background ("negative") class
+            * otherwise: one or more foreground ("positive") classes
+        num_samples (int): The total number of labels with value >= 0 to return.
+            Values that are not sampled will be filled with -1 (ignore).
+        positive_fraction (float): The number of subsampled labels with values > 0
+            is `min(num_positives, int(positive_fraction * num_samples))`. The number
+            of negatives sampled is `min(num_negatives, num_samples - num_positives_sampled)`.
+            In order words, if there are not enough positives, the sample is filled with
+            negatives. If there are also not enough negatives, then as many elements are
+            sampled as is possible.
+        bg_label (int): label index of background ("negative") class.
+
+    Returns:
+        pos_idx, neg_idx (tensor):
+            1D vector of indices. The total length of both is `num_samples` or fewer.
+    """
+    positive = paddle.nonzero(paddle.logical_and(labels != -1, labels != bg_label))
+    negative = paddle.nonzero(labels == bg_label)
+
+    num_pos = int(num_samples * positive_fraction)
+    # protect against not enough positive examples
+    num_pos = min(positive.numel(), num_pos)
+    num_neg = num_samples - num_pos
+    # protect against not enough negative examples
+    num_neg = min(negative.numel(), num_neg)
+
+    if num_pos == 0 and num_neg == 0:
+        pos_idx = paddle.zeros([0], dtype='int32')
+        neg_idx = paddle.zeros([0], dtype='int32')
+        return pos_idx, neg_idx
+
+    # randomly select positive and negative examples
+    negative = negative.cast('int32').flatten()
+    neg_perm = paddle.randperm(negative.numel(), dtype='int32')[:int(num_neg)]
+    neg_idx = paddle.gather(negative, neg_perm)
+
+    if num_pos == 0:
+        pos_idx = paddle.zeros([0], dtype='int32')
+        return pos_idx, neg_idx
+
+    positive = positive.cast('int32').flatten()
+    pos_perm = paddle.randperm(positive.numel(), dtype='int32')[:int(num_pos)]
+    pos_idx = paddle.gather(positive, pos_perm)
+
+    return pos_idx, neg_idx
+    
+
+def anchor_target_assign(anchors,
+                         gt_boxes,
+                         positive_thresh,
+                         negative_thresh,
+                         batch_size_per_image,
+                         positive_fraction,
+                         allow_low_quality_matches=False,
+                         is_crowd=None,
+                         weights=[1., 1., 1., 1.]):
+    '''
+    Args:
+        anchors (tensor): shape [-1, 4] the sum of muti-level anchors.
+        gt_boxes (list): gt_boxes[i] is the i-th img's gt_boxes.
+        positive_thresh (float): the positive class threshold of iou between anchors and gt.
+        negative_thresh (float): the negative class threshold of iou between anchors and gt.
+        batch_size_per_image (int): number of anchors per image to sample for training.
+        positive_fraction (float): fraction of foreground anchors to sample for training.
+        allow_low_quality_matches (bool): if True, produce additional matches
+            for predictions with maximum match quality lower than high_threshold.
+        is_crowd (list | None): is_crowd[i] is is_crowd label of the i-th img's gt_boxes.
+        weights (list): more detail please see bbox2delta.
+
+    Returns:
+        tgt_labels (list[tensor]): tgt_labels[i].shape is [Ni], the label(positive or negative) of anchors.
+        tgt_bboxes (list[tensor]): tgt_bboxes[i].shape is [Ni, 4], the matched gt_boxes.
+        tgt_deltas (list[tensor]): tgt_deltas[i].shape is [Ni, 4], the deltas between anchors and gt_boxes.
+    '''
+    tgt_labels = []
+    tgt_bboxes = []
+    tgt_deltas = []
+
+    low_thresh = -float("inf")
+    for i in range(len(gt_boxes)):
+        gt_bbox = gt_boxes[i]
+        n_gt = gt_bbox.shape[0]
+        
+        if n_gt == 0 or is_crowd is None:
+            n_is_crowd = 0 
+        else:
+            is_crowd_i = is_crowd[i]
+            n_is_crowd = paddle.nonzero(is_crowd_i).shape[0]
+
+        match_quality_matrix, _ = boxes_iou(gt_bbox, anchors)
+        assert match_quality_matrix.dim() == 2
+        
+        # ignore the iou between anchor and crowded ground-truth
+        if n_is_crowd > 0:
+            n_a = anchors.shape[0]
+            ones = paddle.ones([n_a])
+            mask = is_crowd_i * ones
+            match_quality_matrix = match_quality_matrix * (1 - mask) - mask
+            low_thresh = -1
+        # match_quality_matrix is N (gt) x M (predicted)
+        # assert (match_quality_matrix >= 0).all()
+        if match_quality_matrix.shape[0] == 0 or n_gt == n_is_crowd:
+            matches = paddle.full((match_quality_matrix.shape[1], ), 0, dtype='int64')
+            match_labels = paddle.full((match_quality_matrix.shape[1], ), 0, dtype='int32')
+        else:
+            matches, match_labels = anchor_target_matcher(match_quality_matrix,
+                                                          positive_thresh,
+                                                          negative_thresh,
+                                                          allow_low_quality_matches,
+                                                          low_thresh)
+        
+        pos_idx, neg_idx = subsample_labels(match_labels, 
+                                            batch_size_per_image, 
+                                            positive_fraction)
+
+        # Fill with the ignore label (-1), then set positive and negative labels
+        labels = paddle.full(match_labels.shape, -1, dtype='int32')
+        if neg_idx.shape[0] > 0:
+            labels = paddle.scatter(labels, neg_idx, paddle.zeros_like(neg_idx))
+        if pos_idx.shape[0] > 0:
+            labels = paddle.scatter(labels, pos_idx, paddle.ones_like(pos_idx))
+        
+        if n_gt == 0:
+            matched_gt_boxes = paddle.zeros([0, 4])
+            tgt_delta = paddle.zeros([0, 4])
+        else:
+            matched_gt_boxes = paddle.gather(gt_bbox, matches)
+            tgt_delta = bbox2delta(anchors, matched_gt_boxes, weights)
+            matched_gt_boxes.stop_gradient = True
+            tgt_delta.stop_gradient = True
+
+        labels.stop_gradient = True
+        tgt_labels.append(labels)
+        tgt_bboxes.append(matched_gt_boxes)
+        tgt_deltas.append(tgt_delta)
+
+    return tgt_labels, tgt_bboxes, tgt_deltas
+
+
+def roi_target_assign(proposals,
+                      gt_boxes,
+                      gt_classes,
+                      num_classes,
+                      positive_thresh,
+                      negative_thresh,
+                      batch_size_per_image,
+                      positive_fraction,
+                      allow_low_quality_matches=False):
+    '''
+    It performs box matching between "roi" and "target",and assigns training labels
+    to the proposals. 
+
+    Args:
+        proposals (list[tensor]): the batch RoIs from rpn_head.
+        gt_boxes (list[tensor]): gt_boxes[i] is the i'th img's gt_boxes.
+        gt_classes (list[tensor]): gt_classes[i] is the i'th img's gt_classes.
+        num_classes (int): the number of class.
+    
+    Returns:
+        proposals_info (dict): a dict contains the information of proposals. 
+    '''
+
+    proposals_info = {}
+    num_fg_samples = []
+    proposals_samples = []
+    num_proposals = []
+    gt_boxes_samples = []
+    gt_cls_samples = []
+
+    for proposals_single_img, bbox_single_img, label_single_img in zip(proposals, gt_boxes, gt_classes):
+        match_quality_matrix, _ = boxes_iou(bbox_single_img, proposals_single_img)
+        matched_idxs, matched_labels = anchor_target_matcher(match_quality_matrix, 
+                                                             positive_thresh,
+                                                             negative_thresh,
+                                                             allow_low_quality_matches)
+
+        if label_single_img.numel() > 0:
+            label_single_img = label_single_img.flatten() # squeeze may get scalar
+            label_single_img = paddle.gather(label_single_img, matched_idxs)
+            label_single_img = paddle.where(matched_labels == 0,
+                                            paddle.full_like(label_single_img, num_classes),
+                                            label_single_img)
+
+            label_single_img = paddle.where(matched_labels == -1,
+                                            paddle.full_like(label_single_img, -1),
+                                            label_single_img)
+        else:
+            label_single_img = paddle.zeros_like(matched_idxs) + num_classes
+            sample_gt_box = paddle.zeros_like(bbox_single_img)
+
+        sampled_fg_idxs, sampled_bg_idxs = subsample_labels(label_single_img,
+                                                            batch_size_per_image,
+                                                            positive_fraction,
+                                                            num_classes)
+
+        sampled_idxs = paddle.concat([sampled_fg_idxs, sampled_bg_idxs])
+        sample_proposal = paddle.gather(proposals_single_img, sampled_idxs)
+        sample_gt_cls = paddle.gather(label_single_img, sampled_idxs)
+
+        if label_single_img.numel() > 0:
+            sample_box_idx = paddle.gather(matched_idxs, sampled_idxs)
+            sample_gt_box = paddle.gather(bbox_single_img, sample_box_idx)
+
+        num_fg_samples.append(sampled_fg_idxs.shape[0])      
+        proposals_samples.append(sample_proposal)
+        num_proposals.append(sampled_idxs.shape[0])
+        gt_boxes_samples.append(sample_gt_box)
+        gt_cls_samples.append(sample_gt_cls)
+    
+    proposals_info["num_fg"] = num_fg_samples
+    proposals_info["proposals"] = proposals_samples
+    proposals_info["num_proposals"] = num_proposals
+    proposals_info["gt_boxes"] = gt_boxes_samples
+    proposals_info["gt_classes"] = gt_cls_samples
+
+    return proposals_info
diff --git a/object_detection/PVTv2/det_heads/maskrcnn_head/config.py b/object_detection/PVTv2/det_heads/maskrcnn_head/config.py
new file mode 100644
index 00000000..5293c9ec
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/maskrcnn_head/config.py
@@ -0,0 +1,51 @@
+import sys
+import numpy as np
+import paddle
+from yacs.config import CfgNode as CN
+
+config = CN()
+config.FPN = CN()
+config.RPN = CN()
+config.ROI = CN()
+config.ROI.BOX_HEAD = CN()
+
+config.FPN.OUT_CHANNELS = 256
+config.RPN.ANCHOR_SIZE = [[32], [64], [128], [256], [512]]
+config.RPN.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+config.RPN.STRIDES = [4, 8, 16, 32, 64]
+config.RPN.OFFSET = 0.0
+config.RPN.PRE_NMS_TOP_N_TRAIN = 2000
+config.RPN.POST_NMS_TOP_N_TRAIN = 1000
+config.RPN.PRE_NMS_TOP_N_TEST = 1000
+config.RPN.POST_NMS_TOP_N_TEST = 1000
+config.RPN.NMS_THRESH = 0.7
+config.RPN.MIN_SIZE = 0.0
+config.RPN.TOPK_AFTER_COLLECT = True
+config.RPN.POSITIVE_THRESH = 0.7
+config.RPN.NEGATIVE_THRESH = 0.3
+config.RPN.BATCH_SIZE_PER_IMG = 256
+config.RPN.POSITIVE_FRACTION = 0.5
+config.RPN.LOW_QUALITY_MATCHES = True
+
+config.ROI.SCORE_THRESH_INFER = 0.05
+config.ROI.NMS_THRESH_INFER = 0.5
+config.ROI.NMS_KEEP_TOPK_INFER =100
+config.ROI.NUM_ClASSES = 80
+config.ROI.POSITIVE_THRESH = 0.5
+config.ROI.NEGATIVE_THRESH = 0.5
+config.ROI.BATCH_SIZE_PER_IMG = 512
+config.ROI.POSITIVE_FRACTION = 0.25
+config.ROI.LOW_QUALITY_MATCHES = True
+config.ROI.BOX_HEAD.REG_WEIGHTS = [10.0, 10.0, 5.0, 5.0]
+config.ROI.BOX_HEAD.NUM_CONV = 0
+config.ROI.BOX_HEAD.CONV_DIM = 256
+config.ROI.BOX_HEAD.NUM_FC = 2
+config.ROI.BOX_HEAD.FC_DIM = 1024
+config.ROI.SCALES = [1./4., 1./8., 1./16., 1./32., 1./64.]
+config.ROI.ALIGN_OUTPUT_SIZE = 7
+config.ROI.SAMPLING_RATIO = 0
+config.ROI.CANONICAL_BOX_SIZE = 224
+config.ROI.CANONICAL_LEVEL = 4
+config.ROI.MIN_LEVEL = 0
+config.ROI.MAX_LEVEL = 3
+config.ROI.ALIGNED = True
diff --git a/object_detection/PVTv2/det_heads/maskrcnn_head/roi_head.py b/object_detection/PVTv2/det_heads/maskrcnn_head/roi_head.py
new file mode 100644
index 00000000..7d0a74e3
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/maskrcnn_head/roi_head.py
@@ -0,0 +1,312 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.initializer import XavierNormal, XavierUniform, Normal
+
+from ..det_utils.target_assign import roi_target_assign
+from ..det_utils.generator_utils import RoIAlign
+from ..det_utils.box_utils import bbox2delta, delta2bbox, multiclass_nms
+
+
+class BoxHead(nn.Layer):
+    """
+    A head with several 3x3 conv layers (each followed by norm & relu), then
+    several fc layers (each followed by relu) and followed by two linear layers 
+    for predicting Fast R-CNN outputs.
+    """
+
+    def __init__(
+        self,
+        num_classes,
+        in_channels,
+        output_size,
+        num_conv,
+        conv_dim,
+        num_fc,
+        fc_dim,
+    ):
+        '''
+        Attributes:
+            num_classes (int): the number of class.
+            in_channels (int): the channels of inputs.
+            output_size (int): the size of output from pooler.
+            num_conv (int): the number of conv.
+            conv_dim (int): the output channels of each conv.
+            num_fc (int): the number of fc.
+            fc_dim (int): the output channels of each fc. 
+        '''
+        
+        super(BoxHead, self).__init__()
+        conv_dims = [conv_dim] * num_conv
+        fc_dims = [fc_dim] * num_fc
+        self.forward_net = nn.Sequential()
+
+        for i, channel in enumerate(conv_dims):
+            conv = nn.Conv2D(
+                in_channels=in_channels,
+                out_channels=channel,
+                kernel_size=3,
+                padding=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierNormal(fan_in=0.0)),
+                bias_attr=True
+            )
+
+            self.forward_net.add_sublayer("conv{}".format(i), conv)
+            self.forward_net.add_sublayer("act_c{}".format(i), nn.ReLU())
+            in_channels = channel
+        
+        in_dim = output_size * output_size *in_channels
+        for i, out_dim in enumerate(fc_dims):
+            if i == 0:
+                self.forward_net.add_sublayer("flatten", nn.Flatten())
+
+            fc = nn.Linear(in_dim,
+                           out_dim,
+                           weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_in=in_dim, fan_out=in_dim)))
+
+            self.forward_net.add_sublayer("linear{}".format(i), fc)
+            self.forward_net.add_sublayer("act_f{}".format(i), nn.ReLU())
+            in_dim = out_dim
+
+        self.cls_fc = nn.Linear(in_dim, 
+                                num_classes + 1,
+                                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.reg_fc = nn.Linear(in_dim, 
+                                num_classes * 4,
+                                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.001)))
+
+    def forward(self, inputs):
+        feats = self.forward_net(inputs)
+        pred_scores = self.cls_fc(feats)
+        pred_deltas = self.reg_fc(feats)
+
+        return [pred_scores, pred_deltas]
+
+
+class RoIHead(nn.Layer):
+    '''
+    RoIHead will match proposals from RPNHead with gt (when training),
+    crop the regions and extract per-region features using proposals,
+    and make per-region predictions.
+    '''
+    def __init__(self, config):
+        super(RoIHead, self).__init__()
+        self.config = config
+
+        self.pooler = RoIAlign(
+            output_size=config.ROI.ALIGN_OUTPUT_SIZE,
+            scales=config.ROI.SCALES,
+            sampling_ratio=config.ROI.SAMPLING_RATIO,
+            canonical_box_size=config.ROI.CANONICAL_BOX_SIZE,
+            canonical_level=config.ROI.CANONICAL_LEVEL,
+            min_level=config.ROI.MIN_LEVEL,
+            max_level=config.ROI.MAX_LEVEL,
+            aligned=config.ROI.ALIGNED
+        )
+
+        self.predictor = BoxHead(
+            num_classes=config.ROI.NUM_ClASSES,
+            in_channels=config.FPN.OUT_CHANNELS,
+            output_size=config.ROI.ALIGN_OUTPUT_SIZE,
+            num_conv=config.ROI.BOX_HEAD.NUM_CONV,
+            conv_dim=config.ROI.BOX_HEAD.CONV_DIM,
+            num_fc=config.ROI.BOX_HEAD.NUM_FC,
+            fc_dim=config.ROI.BOX_HEAD.FC_DIM
+        )
+    
+    def _det_forward(self, feats, proposals_info):
+        roi = proposals_info["proposals"]
+        rois_num = paddle.to_tensor(proposals_info["num_proposals"]).astype("int32")
+        roi_feats = self.pooler(feats, roi, rois_num)
+        predictions = self.predictor(roi_feats)
+
+        return predictions
+    
+    def _get_loss(self, preds, proposals_info):
+        '''
+        Args:
+            preds (list[tensor]): 
+               pred_scores (tensor) shape is (num_proposals, num_cls + 1), The pred class score.
+               pred_deltas (tensor) shape is (num_proposals, num_cls * 4), The pred location.
+        '''
+        pred_scores, pred_deltas = preds
+        n_s = pred_deltas.shape[0]
+
+        proposals = proposals_info["proposals"]
+        gt_classes = paddle.concat(proposals_info["gt_classes"]).reshape([-1])
+        gt_boxes = paddle.concat(proposals_info["gt_boxes"])
+
+        if len(proposals) == 0:
+            proposals = paddle.zeros(shape=[n_s, 4], dtype="float32")
+            tgt_scores = paddle.full(shape=[n_s,], fill_value=-1, dtype="float32")
+            tgt_boxes = paddle.zeros(shape=[n_s, 4], dtype="float32")
+        else:
+            proposals = paddle.concat(proposals)
+            tgt_scores = gt_classes.reshape([-1, 1])
+            tgt_boxes = gt_boxes.reshape([-1, 4])
+
+        losses = {
+            "loss_cls": F.cross_entropy(pred_scores, tgt_scores.astype("int64"), reduction='mean')
+        }
+
+        fg_idx = paddle.nonzero(
+            paddle.logical_and(gt_classes >= 0, gt_classes < self.config.ROI.NUM_ClASSES)
+        ).flatten()
+
+        #TODO: errors raised when fg_idx is [] tensor, when train from scratch
+        fg_cls_base = paddle.gather(x=gt_classes, index=fg_idx)
+        fg_cls_start = paddle.arange(0, self.config.ROI.NUM_ClASSES * fg_idx.shape[0], self.config.ROI.NUM_ClASSES)
+        fg_cls_idx = fg_cls_base + fg_cls_start
+        fg_cls_idx = fg_cls_idx.astype('int64')
+
+        fg_idx.stop_gradient = True
+        tgt_boxes.stop_gradient = True
+        proposals.stop_gradient = True
+        tgt_scores.stop_gradient = True
+        fg_cls_base.stop_gradient = True
+        fg_cls_start.stop_gradient = True
+
+        pred_deltas = pred_deltas.reshape([-1, self.config.ROI.NUM_ClASSES, 4])
+        pred_deltas = paddle.gather(pred_deltas, fg_idx, axis=0).reshape([-1, 4])
+
+        pred_deltas = paddle.gather(pred_deltas, fg_cls_idx)
+
+        tgt_boxes = paddle.gather(tgt_boxes, fg_idx)
+        proposals = paddle.gather(proposals, fg_idx)
+
+        tgt_deltas = bbox2delta(proposals, tgt_boxes, self.config.ROI.BOX_HEAD.REG_WEIGHTS)
+
+        loss_reg = F.l1_loss(pred_deltas, tgt_deltas, reduction="sum") / max(gt_classes.numel(), 1.0)
+
+        losses["loss_reg"] = loss_reg
+
+        return losses
+    
+    def _inference(self, preds, proposals_info, inputs):
+        num_proposals = proposals_info["num_proposals"]
+        proposals = proposals_info["proposals"]
+        proposals = paddle.concat(proposals)
+
+        if not len(num_proposals):
+            return None
+        
+        pred_scores, pred_deltas = preds
+
+        # pred_bbox shape [num_proposals_all, num_classes, 4]
+        pred_bbox = delta2bbox(pred_deltas, 
+                               proposals, 
+                               self.config.ROI.BOX_HEAD.REG_WEIGHTS)
+
+        pred_bbox_list = paddle.split(pred_bbox, num_proposals)
+        pred_bbox_list = paddle.split(pred_bbox, num_proposals)
+        pred_scores = F.softmax(pred_scores)
+        pred_scores_list = paddle.split(pred_scores, num_proposals)
+
+        post_pred = []
+        for i in range(len(pred_bbox_list)):
+            num_p = num_proposals[i]
+            img_pred_boxes = pred_bbox_list[i]
+            img_pred_scores = pred_scores_list[i]
+            img_hw = inputs["imgs_shape"][i]
+            img_scale_factor = inputs["scale_factor_wh"][i]
+
+            img_pred_boxes[:, :, 0::2] = paddle.clip(
+                img_pred_boxes[:, :, 0::2], min=0, max=img_hw[1]
+            ) / img_scale_factor[0]
+
+            img_pred_boxes[:, :, 1::2] = paddle.clip(
+                img_pred_boxes[:, :, 1::2], min=0, max=img_hw[0]
+            ) / img_scale_factor[1]
+
+
+            output = multiclass_nms(bboxes=img_pred_boxes,
+                                    scores=img_pred_scores[:, :-1],
+                                    score_threshold=self.config.ROI.SCORE_THRESH_INFER,
+                                    keep_top_k=self.config.ROI.NMS_KEEP_TOPK_INFER,
+                                    nms_threshold=self.config.ROI.NMS_THRESH_INFER,
+                                    background_label=self.config.ROI.NUM_ClASSES,
+                                    rois_num=paddle.to_tensor([num_p]).astype("int32"))
+
+
+            if output[1][0] == 0:
+                post_pred.append(paddle.to_tensor([]))
+                continue
+
+            post_label = output[0][:, 0:1]
+            post_score = output[0][:, 1:2]
+            post_boxes = output[0][:, 2:]
+
+            boxes_w = post_boxes[:, 2] - post_boxes[:, 0]
+            boxes_h = post_boxes[:, 3] - post_boxes[:, 1]
+
+            keep = paddle.nonzero(paddle.logical_and(boxes_w > 0., boxes_h > 0.)).flatten()
+
+            post_label = paddle.gather(post_label, keep)
+            post_score = paddle.gather(post_score, keep)
+            post_boxes = paddle.gather(post_boxes, keep)
+
+            final_output = paddle.concat([post_label, post_score, post_boxes], axis=-1)
+            post_pred.append(final_output)
+        
+        return post_pred
+
+    def forward(self, feats, proposals, inputs):
+        '''
+        Args:
+            feats (list[tensor]): the outputs of fpn.
+            proposals (list[tensor]): list[i] denotes the proposals of the i'th imgs
+                from rpn head.
+            inputs (dict): the gt info, eg. gt_boxes, gt_classes, imgs_wh and so on.   
+        
+        Returns:
+            losses (dict) | outputs (list[tensor]): 
+                losses contains cls_losses and reg_losses.
+                the shape of outputs[i] is [M, 6], M is the number of final preds,
+                Each row has 6 values: [label, score, xmin, ymin, xmax, ymax]
+        '''
+
+        if self.training:
+            proposals_info = roi_target_assign(
+                proposals,
+                inputs["gt_boxes"],
+                inputs["gt_classes"],
+                self.config.ROI.NUM_ClASSES,
+                self.config.ROI.POSITIVE_THRESH,
+                self.config.ROI.NEGATIVE_THRESH,
+                self.config.ROI.BATCH_SIZE_PER_IMG,
+                self.config.ROI.POSITIVE_FRACTION,
+                self.config.ROI.LOW_QUALITY_MATCHES
+            )
+
+            predictions = self._det_forward(feats, proposals_info)
+            losses = self._get_loss(predictions, proposals_info)
+
+            return losses
+        
+        else:
+            proposals_info = {"num_proposals": [len(proposal) for proposal in proposals]}
+            proposals_info["proposals"] = proposals
+
+
+            predictions = self._det_forward(feats, proposals_info)
+            outputs = self._inference(predictions, proposals_info, inputs)
+
+            return outputs
diff --git a/object_detection/PVTv2/det_heads/maskrcnn_head/rpn_head.py b/object_detection/PVTv2/det_heads/maskrcnn_head/rpn_head.py
new file mode 100644
index 00000000..57937ce3
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/maskrcnn_head/rpn_head.py
@@ -0,0 +1,238 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.initializer import Normal
+
+import sys
+from ..det_utils.generator_utils import AnchorGenerator, ProposalGenerator
+from ..det_utils.target_assign import anchor_target_assign
+
+
+class RPNHead(nn.Layer):
+    """
+    Region Proposal Network uses a 3x3 conv to produce a shared hidden state from which one 1x1 conv 
+    predicts objectness logits for each anchor and a second 1x1 conv predicts bounding-box deltas.
+
+    Attributes:
+        anchor_generator (class): the generator of anchor. 
+        train_proposal (class): configure of proposals generation at the stage of training.
+        test_proposal (class): configure of proposals generation at the stage of prediction.
+        in_channels (int): channel of input feature maps which can be derived by from_config.
+    """
+    def __init__(self, config):
+        super(RPNHead, self).__init__()
+        self.anchor_generator = AnchorGenerator(anchor_sizes=config.RPN.ANCHOR_SIZE,
+                                                aspect_ratios=config.RPN.ASPECT_RATIOS,
+                                                strides=config.RPN.STRIDES,
+                                                offset=config.RPN.OFFSET)
+        self.train_proposal = ProposalGenerator(pre_nms_top_n=config.RPN.PRE_NMS_TOP_N_TRAIN,
+                                                post_nms_top_n=config.RPN.POST_NMS_TOP_N_TRAIN,
+                                                nms_thresh=config.RPN.NMS_THRESH,
+                                                min_size=config.RPN.MIN_SIZE,
+                                                topk_after_collect=config.RPN.TOPK_AFTER_COLLECT)
+        self.test_proposal = ProposalGenerator(pre_nms_top_n=config.RPN.PRE_NMS_TOP_N_TEST,
+                                               post_nms_top_n=config.RPN.POST_NMS_TOP_N_TEST,
+                                               nms_thresh=config.RPN.NMS_THRESH,
+                                               min_size=config.RPN.MIN_SIZE,
+                                               topk_after_collect=config.RPN.TOPK_AFTER_COLLECT)
+
+        self.num_anchors = self.anchor_generator.num_anchors
+
+        num_channels = config.FPN.OUT_CHANNELS
+        self.conv = nn.Conv2D(num_channels,
+                              num_channels,
+                              kernel_size=3,
+                              padding=1,
+                              weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.objectness_logits = nn.Conv2D(num_channels,
+                                           self.num_anchors,
+                                           kernel_size=1,
+                                           padding=0,
+                                           weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.anchor_deltas = nn.Conv2D(num_channels,
+                                       self.num_anchors * 4,
+                                       kernel_size=1,
+                                       padding=0,
+                                       weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.config = config
+
+    def predict(self, feats):
+        '''
+        Predict the logits of each feature and the deltas of the anchors in each feature.
+
+        Args:
+            feats (list[tensor]): Mutil-level feature from fpn.
+
+        Returns:
+            pred_objectness_logits (list[tensor]): A list of L elements.Element i is a tensor of shape (N, A, Hi, Wi) representing
+                the predicted objectness logits for all anchors. A is the number of cell anchors.
+            pred_anchor_deltas (list[tensor]): A list of L elements. Element i is a tensor of shape (N, A * 4, Hi, Wi) 
+                representing the predicted "deltas" used to transform anchors to proposals.
+        '''
+        
+        pred_objectness_logits = []
+        pred_anchor_deltas = []
+        for idx, feat in enumerate(feats):
+            out = F.relu(self.conv(feat))
+            pred_objectness_logits.append(self.objectness_logits(out))
+            pred_anchor_deltas.append(self.anchor_deltas(out))
+
+        return pred_objectness_logits, pred_anchor_deltas
+    
+    def _get_proposals(self, scores, bbox_deltas, anchors, inputs):
+        '''
+        Args:
+            scores (list[tensor]): the prediction logits of the mutil-level features.
+                scores[i].shape is [N, A, Hi, Wi]
+            bbox_deltas (list[tensor]): the prediction anchor deltas of the mutil-level features.
+                bbox_deltas[i].shape is [N, 4 * A, Hi, Wi]
+            anchors (list[tensor]): the prediction anchor of the mutil-level features.
+                anchors[i].shape is [Hi * Wi * A, 4]
+            inputs (dict): ground truth info.
+        '''
+        proposal_gen = self.train_proposal if self.training else self.test_proposal
+
+        # TODO: fix error when inputs is None (eval mode)
+        imgs_shape = inputs["imgs_shape"]
+        if isinstance(imgs_shape, list):
+            imgs_shape = paddle.stack(imgs_shape).astype("float32")
+
+        batch_size = len(imgs_shape)
+
+        batch_proposal_rois = []
+        batch_proposal_rois_num = []
+        for i in range(batch_size):
+            single_img_rois_list = []
+            single_img_prob_list = []
+
+            for level_scores, level_deltas, level_anchors in zip(scores, bbox_deltas, anchors):
+                level_rois, level_rois_prob, _, post_nms_top_n = proposal_gen(
+                    scores = level_scores[i:i+1],
+                    bbox_deltas = level_deltas[i:i+1],
+                    anchors = level_anchors,
+                    imgs_shape = imgs_shape[i:i+1]
+                )
+                if level_rois.shape[0] > 0:
+                    single_img_rois_list.append(level_rois)
+                    single_img_prob_list.append(level_rois_prob)
+
+            
+            if len(single_img_rois_list) == 0:
+                single_img_rois = paddle.zeros(shape=[0, 4]).astype("float32")
+            else:
+                single_img_rois = paddle.concat(single_img_rois_list)
+                single_img_prob = paddle.concat(single_img_prob_list).flatten()
+
+                if single_img_prob.shape[0] > post_nms_top_n:
+                    single_img_topk_prob, topk_inds = paddle.topk(single_img_prob, post_nms_top_n)
+                    single_img_topk_rois = paddle.gather(single_img_rois, topk_inds)
+                else:
+                    single_img_topk_rois = single_img_rois
+            
+            batch_proposal_rois.append(single_img_topk_rois)
+            batch_proposal_rois_num.append(single_img_topk_rois.shape[0])
+        
+        return batch_proposal_rois, batch_proposal_rois_num
+    
+    def _get_losses(self, pred_logits, pred_loc, anchors, inputs):
+        anchors = paddle.concat(anchors)
+        gt_boxes = inputs["gt_boxes"]
+        is_crowd = inputs.get("is_crowd", None)
+
+        tgt_scores, tgt_bboxes, tgt_deltas = anchor_target_assign(
+            anchors,
+            gt_boxes,
+            positive_thresh = self.config.RPN.POSITIVE_THRESH,
+            negative_thresh = self.config.RPN.NEGATIVE_THRESH,
+            batch_size_per_image = self.config.RPN.BATCH_SIZE_PER_IMG,
+            positive_fraction = self.config.RPN.POSITIVE_FRACTION,
+            allow_low_quality_matches = self.config.RPN.LOW_QUALITY_MATCHES,
+            is_crowd = is_crowd
+        )
+
+        # reshape to [N, Hi * Wi * A, 1] for compute loss
+        pred_scores = [
+            s.transpose([0, 2, 3, 1]).reshape([s.shape[0], -1, 1]) for s in pred_logits
+            ]
+        
+        pred_deltas = [
+            d.transpose([0, 2, 3, 1]).reshape([d.shape[0], -1, 4]) for d in pred_loc
+        ]
+
+        pred_scores = paddle.concat(pred_scores, axis = 1).reshape([-1])
+        pred_deltas = paddle.concat(pred_deltas, axis = 1).reshape([-1, 4])
+
+        tgt_scores = paddle.concat(tgt_scores).astype("float32")
+        tgt_deltas = paddle.concat(tgt_deltas).astype("float32")
+        tgt_scores.stop_gradient = True
+        tgt_deltas.stop_gradient = True
+
+        pos_idx = paddle.nonzero(tgt_scores == 1)
+        valid_idx = paddle.nonzero(tgt_scores >= 0)
+
+        if valid_idx.shape[0] == 0:
+            loss_rpn_cls = paddle.zeros([1]).astype("float32")
+        else:
+            pred_scores = paddle.gather(pred_scores, valid_idx)
+            tgt_scores = paddle.gather(tgt_scores, valid_idx).astype("float32")
+            tgt_scores.stop_gradient = True
+            loss_rpn_cls = F.binary_cross_entropy_with_logits(
+                logit=pred_scores, 
+                label=tgt_scores, 
+                reduction="sum"
+            )
+
+        if pos_idx.shape[0] == 0:
+            loss_rpn_reg = paddle.zeros([1]).astype("float32")
+        else:
+            pred_deltas = paddle.gather(pred_deltas, pos_idx)
+            tgt_deltas = paddle.gather(tgt_deltas, pos_idx)
+            loss_rpn_reg = paddle.abs(pred_deltas - tgt_deltas).sum()
+
+        norm = self.config.RPN.BATCH_SIZE_PER_IMG * len(gt_boxes)
+
+        return {
+            'loss_rpn_cls': loss_rpn_cls / norm,
+            'loss_rpn_reg': loss_rpn_reg / norm
+        }
+
+    def forward(self, feats, inputs):
+        '''
+        Args:
+            feats (list[tensor]): Mutil-level feature from fpn.
+            inputs (dict): ground truth info.
+        
+        Returns:
+            rois (list[tensor]): rois[i] is proposals of the i'th img.
+            rois_num (list[int]): rois[i] is number of the i'th img's proposals. 
+            losses_dict (dict | None): when training is dict contains loss_rpn_cls and loss_rpn_reg.
+        '''
+        pred_objectness_logits, pred_anchor_deltas = self.predict(feats)
+        anchors = self.anchor_generator(feats)
+
+        rois, rois_num = self._get_proposals(pred_objectness_logits, pred_anchor_deltas, anchors, inputs)
+        
+        if self.training:
+            losses_dict = self._get_losses(pred_objectness_logits, pred_anchor_deltas, anchors, inputs)
+
+            return rois, rois_num, losses_dict
+        else:
+            return rois, rois_num, None
diff --git a/object_detection/PVTv2/det_heads/retinanet_head/config.py b/object_detection/PVTv2/det_heads/retinanet_head/config.py
new file mode 100644
index 00000000..8799956c
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/retinanet_head/config.py
@@ -0,0 +1,27 @@
+import numpy as np
+import paddle
+from yacs.config import CfgNode as CN
+
+config = CN()
+config.RETINANET = CN()
+
+config.RETINANET.NUM_CONVS = 4
+config.RETINANET.INPUT_CHANNELS = 256
+config.RETINANET.NORM = ""
+config.RETINANET.PRIOR_PROB = 0.01
+config.RETINANET.NUM_CLASSES = 80
+config.RETINANET.FOCAL_LOSS_ALPHA = 0.25
+config.RETINANET.FOCAL_LOSS_GAMMA = 2
+config.RETINANET.SMOOTHL1_LOSS_DELTA = 0
+config.RETINANET.POSITIVE_THRESH = 0.5
+config.RETINANET.NEGATIVE_THRESH = 0.4
+config.RETINANET.ALLOW_LOW_QUALITY = True
+config.RETINANET.WEIGHTS = [1.0, 1.0, 1.0, 1.0]
+config.RETINANET.SCORE_THRESH = 0.05
+config.RETINANET.KEEP_TOPK = 100
+config.RETINANET.NMS_TOPK = 1000
+config.RETINANET.NMS_THRESH = 0.5
+config.RETINANET.ANCHOR_SIZE = [[x, x * 2**(1.0/3), x * 2**(2.0/3)] for x in [32, 64, 128, 256, 512 ]]
+config.RETINANET.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+config.RETINANET.STRIDES = [8.0, 16.0, 32.0, 64.0, 128.0]
+config.RETINANET.OFFSET = 0
\ No newline at end of file
diff --git a/object_detection/PVTv2/det_heads/retinanet_head/post_process.py b/object_detection/PVTv2/det_heads/retinanet_head/post_process.py
new file mode 100644
index 00000000..79a5def8
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/retinanet_head/post_process.py
@@ -0,0 +1,121 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+
+from det_utils.box_utils import nonempty_bbox, delta2bbox, multiclass_nms
+
+class RetinaNetPostProcess(object):
+    '''
+    This class used to post_process the RetianNet-Head's output.
+    '''
+    def __init__(self, 
+                 score_threshold,
+                 keep_top_k,
+                 nms_top_k,
+                 nms_threshold,
+                 bbox_reg_weights=[1.0, 1.0, 1.0, 1.0]):
+        super(RetinaNetPostProcess, self).__init__()
+        self.score_threshold=score_threshold
+        self.keep_topk=keep_top_k
+        self.topk_candidates=nms_top_k
+        self.num_thresh=nms_threshold
+        self.bbox_reg_weights = bbox_reg_weights
+
+    def _process_single_level_pred(self, box_lvl, score_lvl, anchors, scale_factor_wh, img_whwh):
+        if isinstance(scale_factor_wh, list):
+            scale_factor_wh = paddle.concat(scale_factor_wh)
+        if isinstance(img_whwh, list):
+            img_whwh = paddle.concat(img_whwh)
+
+        score_lvl = paddle.transpose(score_lvl, [0, 2, 1])
+        score_lvl = F.sigmoid(score_lvl)
+
+        batch_lvl = []
+        for i in range(len(img_whwh)):
+            box_lvl_i = delta2bbox(box_lvl[i],
+                                    anchors,
+                                    self.bbox_reg_weights).reshape(anchors.shape)
+
+            box_lvl_i[:, 0::2] = paddle.clip(
+                box_lvl_i[:, 0::2], min=0, max=img_whwh[i][0]
+            ) / scale_factor_wh[i][0]
+            box_lvl_i[:, 1::2] =  paddle.clip(
+                box_lvl_i[:, 1::2], min=0, max=img_whwh[i][1]
+            ) / scale_factor_wh[i][1]
+
+            batch_lvl.append(box_lvl_i)
+
+        box_lvl = paddle.stack(batch_lvl)
+
+        return box_lvl, score_lvl
+
+    def __call__(self, pred_scores_list, pred_boxes_list, anchors, scale_factor_wh, img_whwh):
+        """
+        Args:
+            pred_scores_list (list[Tensor]): tensor of shape (batch_size, R, num_classes).
+                The tensor predicts the classification probability for each proposal.
+            pred_boxes_list (list[Tensor]): tensors of shape (batch_size, R, 4).
+                The tensor predicts anchor's delta
+            anchors (list[Tensor]): mutil-level anchors.
+            scale_factor_wh (Tensor): tensors of shape [batch_size, 2] the scalor of  per img
+            img_whwh (Tensor): tensors of shape [batch_size, 4]
+        Returns:
+            bbox_pred (Tensor): tensors of shape [num_boxes, 6] Each row has 6 values:
+            [label, confidence, xmin, ymin, xmax, ymax]
+            bbox_num (Tensor): tensors of shape [batch_size] the number of RoIs in each image.
+        """
+        assert len(pred_boxes_list[0]) == len(scale_factor_wh) == len(img_whwh)
+        assert len(pred_boxes_list) == len(anchors)
+
+        mutil_level_bbox = []
+        mutil_level_score = []
+
+        for i in range(len(pred_boxes_list)):
+            lvl_res_b, lvl_res_s = self._process_single_level_pred(
+                pred_boxes_list[i],
+                pred_scores_list[i],
+                anchors[i],
+                scale_factor_wh,
+                img_whwh)
+
+            mutil_level_bbox.append(lvl_res_b)
+            mutil_level_score.append(lvl_res_s)
+
+        pred_boxes = paddle.concat(mutil_level_bbox, axis=1)     # [N, \sum_{i=0}^{n} (Hi * Wi), 4]
+        pred_scores = paddle.concat(mutil_level_score, axis=2)
+
+        assert pred_boxes.shape[1] == pred_scores.shape[2]
+
+        bbox_pred, bbox_num, _ = multiclass_nms(
+            pred_boxes, 
+            pred_scores,
+            score_threshold=self.score_threshold,
+            keep_top_k=self.keep_topk,
+            nms_top_k=self.topk_candidates,
+            nms_threshold=self.num_thresh,
+        )
+
+        pred_label = bbox_pred[:, 0:1]
+        pred_score = bbox_pred[:, 1:2]
+        pred_bbox = bbox_pred[:, 2:]
+        keep_mask = nonempty_bbox(pred_bbox, return_mask=True)
+        keep_mask = paddle.unsqueeze(keep_mask, [1])
+        pred_label = paddle.where(keep_mask, pred_label,
+                                  paddle.ones_like(pred_label) * -1)
+
+        pred_result = paddle.concat([pred_label, pred_score, pred_bbox], axis=1)
+
+        return pred_result, bbox_num
diff --git a/object_detection/PVTv2/det_heads/retinanet_head/retinanet_head.py b/object_detection/PVTv2/det_heads/retinanet_head/retinanet_head.py
new file mode 100644
index 00000000..2230323f
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/retinanet_head/retinanet_head.py
@@ -0,0 +1,166 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+import paddle.nn as nn
+
+from paddle.nn.initializer import Normal, Constant
+
+from retinanet_loss import RetinaNetLoss
+from post_process import RetinaNetPostProcess
+from det_utils.generator_utils import AnchorGenerator
+
+class RetinaNetHead(nn.Layer):
+    '''
+    The head used in RetinaNet for object classification and box regression.
+    It has two subnets for the two tasks, with a common structure but separate parameters.
+    '''
+    def __init__(self, config):
+        '''
+        Args:
+            input_shape (List[ShapeSpec]): input shape.
+            num_classes (int): number of classes. Used to label background proposals.
+            num_anchors (int): number of generated anchors.
+            conv_dims (List[int]): dimensions for each convolution layer.
+            norm (str or callable):
+                    Normalization for conv layers except for the two output layers.
+                    See :func:`detectron2.layers.get_norm` for supported types.
+            loss_func (class): the class is used to compute loss.
+            prior_prob (float): Prior weight for computing bias.
+        '''
+        super(RetinaNetHead, self).__init__()
+
+        num_convs = config.RETINANET.NUM_CONVS
+        input_channels = config.RETINANET.INPUT_CHANNELS
+        norm = config.RETINANET.NORM
+        prior_prob = config.RETINANET.PRIOR_PROB
+
+        self.num_classes = config.RETINANET.NUM_CLASSES
+        self.get_loss = RetinaNetLoss(
+            focal_loss_alpha=config.RETINANET.FOCAL_LOSS_ALPHA,
+            focal_loss_gamma=config.RETINANET.FOCAL_LOSS_GAMMA,
+            smoothl1_loss_delta=config.RETINANET.SMOOTHL1_LOSS_DELTA,
+            positive_thresh=config.RETINANET.POSITIVE_THRESH,
+            negative_thresh=config.RETINANET.NEGATIVE_THRESH,
+            allow_low_quality=config.RETINANET.ALLOW_LOW_QUALITY,
+            num_classes=config.RETINANET.NUM_CLASSES,
+            weights=config.RETINANET.WEIGHTS
+        )
+        self.postprocess = RetinaNetPostProcess(
+            score_threshold=config.RETINANET.SCORE_THRESH,
+            keep_top_k=config.RETINANET.KEEP_TOPK,
+            nms_top_k=config.RETINANET.NMS_TOPK,
+            nms_threshold=config.RETINANET.NMS_THRESH,
+            bbox_reg_weights=config.RETINANET.WEIGHTS
+        )
+        self.anchor_generator = AnchorGenerator(anchor_sizes=config.RETINANET.ANCHOR_SIZE,
+                                                aspect_ratios=config.RETINANET.ASPECT_RATIOS,
+                                                strides=config.RETINANET.STRIDES,
+                                                offset=config.RETINANET.OFFSET)
+
+        num_anchors = self.anchor_generator.num_anchors
+        conv_dims = [input_channels] * num_convs
+
+        cls_net = []
+        reg_net = []
+
+        for in_channels, out_channels in zip(
+            [input_channels] + list(conv_dims), conv_dims
+        ):
+            cls_net.append(
+                nn.Conv2D(in_channels, out_channels, kernel_size=3, stride=1, padding=1,
+                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+            )
+            if norm == "bn":
+                cls_net.append(nn.BatchNorm2D(out_channels))
+            cls_net.append(nn.ReLU())
+
+            reg_net.append(
+                nn.Conv2D(in_channels, out_channels, kernel_size=3, stride=1, padding=1,
+                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+            )
+            if norm == "bn":
+                reg_net.append(nn.BatchNorm2D(out_channels))
+            reg_net.append(nn.ReLU())
+
+        self.cls_net = nn.Sequential(*cls_net)
+        self.reg_net = nn.Sequential(*reg_net)
+
+        bias_value = -math.log((1 - prior_prob) / prior_prob)
+        self.cls_score = nn.Conv2D(
+            conv_dims[-1], num_anchors * self.num_classes, kernel_size=3, stride=1, padding=1,
+            weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)),
+            bias_attr=paddle.ParamAttr(initializer=Constant(bias_value))
+        )
+        self.bbox_pred = nn.Conv2D(
+            conv_dims[-1], num_anchors * 4, kernel_size=3, stride=1, padding=1,
+            weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01))
+        )
+
+    def forward(self, feats, inputs):
+        '''
+         Returns:
+            loss_dict (dict) | pred_result(tensor), bbox_num(tensor): 
+            loss_dict: contains cls_losses and reg_losses.
+            pred_result: the shape is [M, 6], M is the number of final preds,
+                Each row has 6 values: [label, score, xmin, ymin, xmax, ymax]
+            bbox_num: the shape is [N], N is the num of batch_size, 
+                bbox_num[i] means the i'th img have bbox_num[i] boxes.
+        '''
+        anchors = self.anchor_generator(feats)
+
+        pred_scores = []
+        pred_boxes = []
+
+        for feat in feats:
+            pred_scores.append(self.cls_score(self.cls_net(feat)))
+            pred_boxes.append(self.bbox_pred(self.reg_net(feat)))
+        
+        pred_scores_list = [
+            transpose_to_bs_hwa_k(s, self.num_classes) for s in pred_scores
+        ]
+        pred_boxes_list = [
+            transpose_to_bs_hwa_k(s, 4) for s in pred_boxes
+        ]
+
+        if self.training:
+            anchors = paddle.concat(anchors)
+            loss_dict = self.get_loss(anchors, [pred_scores_list, pred_boxes_list], inputs)
+
+            return loss_dict
+        
+        else:
+            img_whwh = paddle.concat([inputs["imgs_shape"][:, 1:2],
+                                      inputs["imgs_shape"][:, 0:1]], axis=-1)
+            pred_result, bbox_num = self.postprocess(
+                pred_scores_list, 
+                pred_boxes_list, 
+                anchors,
+                inputs["scale_factor_wh"], 
+                img_whwh
+            )
+
+            return pred_result, bbox_num
+
+
+def transpose_to_bs_hwa_k(tensor, k):
+    assert tensor.dim() == 4
+    bs, _, h, w = tensor.shape
+    tensor = tensor.reshape([bs, -1, k, h, w])
+    tensor = tensor.transpose([0, 3, 4, 1, 2])
+
+    return tensor.reshape([bs, -1, k])
diff --git a/object_detection/PVTv2/det_heads/retinanet_head/retinanet_loss.py b/object_detection/PVTv2/det_heads/retinanet_head/retinanet_loss.py
new file mode 100644
index 00000000..53cf722b
--- /dev/null
+++ b/object_detection/PVTv2/det_heads/retinanet_head/retinanet_loss.py
@@ -0,0 +1,142 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import sys
+sys.path.append("PPViT-od_head/object_detection/head")
+from det_utils.box_utils import bbox2delta, boxes_iou
+from det_utils.target_assign import anchor_target_matcher
+
+class RetinaNetLoss(nn.Layer):
+    def __init__(
+        self,
+        focal_loss_alpha,
+        focal_loss_gamma,
+        smoothl1_loss_delta,
+        positive_thresh,
+        negative_thresh,
+        allow_low_quality=True,
+        num_classes=80,
+        weights=[1.0, 1.0, 1.0, 1.0]
+    ):
+        super(RetinaNetLoss, self).__init__()
+
+        self.num_classes = num_classes
+        self.focal_loss_alpha = focal_loss_alpha
+        self.focal_loss_gamma = focal_loss_gamma
+        self.smoothl1_loss_delta = smoothl1_loss_delta
+        self.positive_thresh = positive_thresh
+        self.negative_thresh = negative_thresh
+        self.allow_low_quality = allow_low_quality
+        self.weights = weights
+
+        self.loss_normalizer = 100
+        self.loss_normalizer_momentum = 0.9
+
+    def label_anchors(self, anchors, gt):
+        batch_gt_box = gt["gt_boxes"]
+        batch_gt_class = gt["gt_classes"]
+
+        gt_labels_list = []
+        gt_boxes_list = []
+
+        for i in range(len(batch_gt_box)):
+            gt_boxes = batch_gt_box[i]
+            gt_classes = batch_gt_class[i].flatten()
+
+            match_quality_matrix, _ = boxes_iou(gt_boxes, anchors)
+            matches_idxs, match_labels = anchor_target_matcher(
+                match_quality_matrix, 
+                self.positive_thresh,
+                self.negative_thresh,
+                self.allow_low_quality,
+                low_thresh = -float("inf")
+            )
+
+            if len(gt_boxes) > 0:
+                matched_boxes_i = paddle.gather(gt_boxes, matches_idxs)
+                matched_classes_i = paddle.gather(gt_classes, matches_idxs)
+                matched_classes_i = paddle.where(match_labels == 0,
+                                                 paddle.full_like(matched_classes_i, self.num_classes),
+                                                 matched_classes_i)
+                matched_classes_i = paddle.where(match_labels == -1,
+                                                 paddle.full_like(matched_classes_i, -1),
+                                                 matched_classes_i)
+            else:
+                matched_boxes_i = paddle.zeros_like(anchors)
+                matched_classes_i = paddle.zeros_like(matches_idxs) + self.num_classes
+
+            gt_boxes_list.append(matched_boxes_i)
+            gt_labels_list.append(matched_classes_i)
+
+        return gt_boxes_list, gt_labels_list
+
+    def forward(self, anchors, preds, inputs):
+
+        pred_scores_list, pred_boxes_list = preds
+
+        p_s = paddle.concat(pred_scores_list, axis=1)
+        p_b = paddle.concat(pred_boxes_list, axis=1)  # [N, R, 4]
+
+        gt_boxes, gt_classes = self.label_anchors(anchors, inputs)
+        gt_labels = paddle.stack(gt_classes).reshape([-1])  # [N * R]
+
+        valid_idx = paddle.nonzero(gt_labels >= 0)
+        pos_mask = paddle.logical_and(gt_labels >= 0, gt_labels != self.num_classes)
+        pos_idx = paddle.nonzero(pos_mask).flatten()
+        num_pos = pos_idx.shape[0]
+
+        self.loss_normalizer = self.loss_normalizer_momentum * self.loss_normalizer + (
+            1 - self.loss_normalizer_momentum
+        ) * max(num_pos, 1)
+
+        p_s = paddle.reshape(p_s, [-1, self.num_classes])
+        pred_logits = paddle.gather(p_s, valid_idx)
+
+        gt_labels = F.one_hot(paddle.gather(gt_labels, valid_idx), num_classes=self.num_classes + 1)[
+            :, :-1
+        ]
+
+        gt_labels.stop_gradient = True
+
+        cls_loss = F.sigmoid_focal_loss(pred_logits,
+                                        gt_labels,
+                                        alpha=self.focal_loss_alpha,
+                                        gamma=self.focal_loss_gamma,
+                                        reduction='sum')
+
+        gt_deltas_list = [
+            bbox2delta(anchors, gt_boxes[i], self.weights) for i in range(len(gt_boxes))
+        ]
+
+        gt_deltas = paddle.concat(gt_deltas_list)
+        gt_deltas = paddle.gather(gt_deltas, pos_idx)
+        gt_deltas.stop_gradient = True
+
+        p_b = paddle.reshape(p_b, [-1, 4])
+        pred_deltas = paddle.gather(p_b, pos_idx)
+
+        if self.smoothl1_loss_delta > 0:
+            reg_loss = F.smooth_l1_loss(pred_deltas, gt_deltas, reduction="sum",  delta=self.smoothl1_loss_delta)
+        else:
+            reg_loss = F.l1_loss(pred_deltas, gt_deltas, reduction="sum")
+
+        return {
+            "cls_loss": cls_loss / self.loss_normalizer,
+            "reg_loss": reg_loss / self.loss_normalizer
+        }
diff --git a/object_detection/PVTv2/det_necks/__init__.py b/object_detection/PVTv2/det_necks/__init__.py
new file mode 100644
index 00000000..e0a8f9c1
--- /dev/null
+++ b/object_detection/PVTv2/det_necks/__init__.py
@@ -0,0 +1 @@
+from . import fpn
diff --git a/object_detection/PVTv2/det_necks/fpn.py b/object_detection/PVTv2/det_necks/fpn.py
new file mode 100644
index 00000000..648b99f9
--- /dev/null
+++ b/object_detection/PVTv2/det_necks/fpn.py
@@ -0,0 +1,212 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+"""FPN Lyaer for object detection"""
+import math
+import paddle
+import paddle.nn as nn
+from paddle.nn.initializer import XavierUniform
+import paddle.nn.functional as F
+
+
+class ConvNorm(nn.Layer):
+    """ Conv + BatchNorm (optional) layers
+    Args:
+        in_channels: int, num of input channels 
+        out_channels: int, num of output channels
+        kernel_size: int, conv kernel size
+        stride: int, stride in conv layer, default: 1
+        padding: int, padding in conv layer, default: 0 
+        dilation: int, dilation in conv layer, default: 1 
+        groups: int, groups in conv layer, default: 1 
+        padding_mode: str, padding mode, default: 'zeros' 
+        weight_attr: ParamAttr, paddle param setting for weight, default: None 
+        bias_attr: ParamAttr, paddle param setting for bias, default: None
+        norm: string, type of norm layer, default: bn
+    """
+    def __init__(self, 
+                 in_channels, 
+                 out_channels, 
+                 kernel_size, 
+                 stride=1, 
+                 padding=0, 
+                 dilation=1, 
+                 groups=1, 
+                 padding_mode='zeros', 
+                 weight_attr=None, 
+                 bias_attr=None,
+                 norm="bn",
+                 use_bias=False):
+        super(ConvNorm, self).__init__()
+
+        if norm is None:
+            use_bias = None
+
+        self.conv = nn.Conv2D(
+            in_channels=in_channels, 
+            out_channels=out_channels, 
+            kernel_size=kernel_size, 
+            stride=stride, 
+            padding=padding, 
+            dilation=dilation, 
+            groups=groups, 
+            padding_mode=padding_mode, 
+            weight_attr=weight_attr, 
+            bias_attr=use_bias
+        )
+
+        if norm == "bn":
+            self.norm = nn.BatchNorm2D(out_channels)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        out = self.conv(x)
+
+        if self.norm is not None:
+            out = self.norm(out)
+        
+        return out
+
+
+class FPN(nn.Layer):
+    """Feature Pyramid Network (FPN) Layer
+    Args:
+        in_channels: list of int, num of input channels for each output layer
+        out_channels: list of int, num of output channels for each output layer
+        stride: list, spatial strides between each feature layer to the original image size
+        fuse_type: str, how to fuse current and prev feature in FPN, avg or sum, default: sum
+        use_c5: bool, if True, use C5 as the input of extra stage, default: True
+        top_block: nn.Layer, if use a downsample after output (see LastLevelMaxPool), default: None
+        norm: str, type of norm layer, default: None
+    """
+    def __init__(self,
+                 in_channels,
+                 out_channel,
+                 strides,
+                 fuse_type="sum",
+                 use_c5=True,
+                 top_block=None,
+                 norm=None,
+                 use_bias=False):
+        super(FPN, self).__init__()
+        assert len(strides) == len(in_channels)
+
+        self.fuse_type = fuse_type
+        self.top_block = top_block
+        self.use_c5 = use_c5
+
+        lateral_convs = []
+        output_convs = []
+
+        name_idx = [int(math.log2(s)) for s in strides]
+
+        for idx, in_channel in enumerate(in_channels):
+            # 1x1 conv 
+            lateral_conv = ConvNorm(
+                in_channels=in_channel, 
+                out_channels=out_channel, 
+                kernel_size=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=in_channel)),
+                norm=norm,
+                use_bias=use_bias)
+            # 3x3 conv after upsampling
+            output_conv = ConvNorm(
+                in_channels=out_channel, 
+                out_channels=out_channel, 
+                kernel_size=3,
+                padding=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*out_channel)),
+                norm=norm,
+                use_bias=use_bias)
+
+            self.add_sublayer("fpn_lateral{}".format(name_idx[idx]), lateral_conv)
+            self.add_sublayer("fpn_output{}".format(name_idx[idx]), output_conv)
+
+            lateral_convs.append(lateral_conv)
+            output_convs.append(output_conv)
+        
+        self.lateral_convs = lateral_convs[::-1] # Now from small feature map to large feature map
+        self.output_convs = output_convs[::-1]
+
+    def forward(self, feats):
+        res = []
+        lateral_out = self.lateral_convs[0](feats[-1]) # feats is from large to small feature map
+        res.append(self.output_convs[0](lateral_out))
+
+        for idx, (lateral_conv, output_conv) in enumerate(
+            zip(self.lateral_convs, self.output_convs)):
+            if idx > 0:  # not include lateral_convs[0]
+                top2down_feat = F.interpolate(lateral_out, scale_factor=2.0, mode="nearest")
+                prev_out = lateral_conv(feats[-1-idx])
+                #top2down_feat = F.interpolate(lateral_out, size=prev_out.shape[-2::], mode="nearest")
+                lateral_out = prev_out + top2down_feat # fuse == 'sum'
+                if self.fuse_type == "avg":
+                    lateral_out /= 2
+                res.insert(0, output_conv(lateral_out))
+        
+        if self.top_block is not None:
+            if self.use_c5:
+                top_block_out = self.top_block(feats[-1])
+            else:
+                top_block_out = self.top_block(res[-1])
+        
+            res.extend(top_block_out)
+
+        return res
+
+
+class LastLevelMaxPool(nn.Layer):
+    """
+    This module is used in the original FPN to generate a downsampled
+    P6 feature from P5.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return [F.max_pool2d(x, kernel_size=1, stride=2)]
+
+
+class TopFeatP6P7(nn.Layer):
+    """
+    This module is used in RetinaNet to generate extra layers, P6 and P7 from
+    C5 feature.
+    """
+    def __init__(self, in_channel, out_channel):
+
+        self.p6 = nn.Conv2D(
+            in_channels=in_channel, 
+            out_channels=out_channel, 
+            kernel_size=3, 
+            stride=2, 
+            padding=1,
+            weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*in_channel))
+        )
+        self.p7 = nn.Conv2D(
+            in_channels=in_channel, 
+            out_channels=out_channel, 
+            kernel_size=3, 
+            stride=2, 
+            padding=1,
+            weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*out_channel))
+        )
+    
+    def forward(self, feat):
+        p6 = self.p6(feat)
+        p7 = self.p7(F.relu(p6))
+
+        return [p6, p7]
diff --git a/object_detection/PVTv2/main_multi_gpu.py b/object_detection/PVTv2/main_multi_gpu.py
new file mode 100644
index 00000000..dd74795e
--- /dev/null
+++ b/object_detection/PVTv2/main_multi_gpu.py
@@ -0,0 +1,421 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""PVTv2 Det training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from coco import build_coco
+from coco import get_dataloader
+from coco_eval import CocoEvaluator
+from pvtv2_det import build_pvtv2_det as build_det_model
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+
+
+parser = argparse.ArgumentParser('PVTv2-Det')
+parser.add_argument('-cfg', type=str, default=None)
+parser.add_argument('-dataset', type=str, default=None)
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-backbone', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+parser.add_argument('-eval', action='store_true')
+arguments = parser.parse_args()
+
+log_format = "%(asctime)s %(message)s"
+logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                    format=log_format, datefmt="%m%d %I:%M:%S %p")
+
+# get default config
+config = get_config()
+# update config by arguments
+config = update_config(config, arguments)
+
+# set output folder
+if not config.EVAL:
+    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+else:
+    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+#config.freeze()
+
+if not os.path.exists(config.SAVE):
+    os.makedirs(config.SAVE, exist_ok=True)
+
+# set logging format
+logger = logging.getLogger()
+fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
+fh.setFormatter(logging.Formatter(log_format))
+logger.addHandler(fh)
+logger.info(f'config= {config}')
+
+
+def train(dataloader,
+          model,
+          base_ds,
+          optimizer,
+          epoch,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, DETR model
+        criterion: nn.Layer
+        postprocessors: nn.Layer
+        base_ds: coco api instance
+        train_loss_rpn_cls_meter.avg
+        epoch: int, current epoch
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+        accum_iter: int, num of iters for accumulating gradients
+    Returns:
+        train_loss_cls_meter.avg
+        train_loss_reg_meter.avg
+        train_loss_rpn_cls_meter.avg
+        train_loss_rpn_reg_meter.avg
+        train_time
+    """
+    model.train()
+
+    train_loss_cls_meter = AverageMeter()
+    train_loss_reg_meter = AverageMeter()
+    train_loss_rpn_cls_meter = AverageMeter()
+    train_loss_rpn_reg_meter = AverageMeter()
+
+    time_st = time.time()
+
+    #iou_types = ('bbox', )
+    #coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    for batch_id, data in enumerate(dataloader):
+        samples = data[0]
+        targets = data[1]
+            
+        loss_dict = model(samples, targets)
+        losses = sum(loss for loss in loss_dict.values())
+        losses.backward()
+
+        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+            optimizer.step()
+            optimizer.clear_grad()
+
+        # logging losses
+        batch_size = samples.tensors.shape[0]
+        train_loss_cls_meter.update(loss_dict['loss_cls'].numpy()[0], batch_size)
+        train_loss_reg_meter.update(loss_dict['loss_reg'].numpy()[0], batch_size)
+        train_loss_rpn_cls_meter.update(loss_dict['loss_rpn_cls'].numpy()[0], batch_size)
+        train_loss_rpn_reg_meter.update(loss_dict['loss_rpn_reg'].numpy()[0], batch_size)
+    
+        if batch_id > 0 and batch_id % debug_steps == 0:
+            logger.info(
+                f"Train Step[{batch_id:04d}/{total_batch:04d}], " + 
+                f"Avg loss_cls: {train_loss_cls_meter.avg:.4f}, " + 
+                f"Avg loss_reg: {train_loss_reg_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_cls: {train_loss_rpn_cls_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_reg: {train_loss_rpn_reg_meter.avg:.4f}") 
+
+    train_time = time.time() - time_st
+    return (train_loss_cls_meter.avg,
+            train_loss_reg_meter.avg,
+            train_loss_rpn_cls_meter.avg,
+            train_loss_rpn_reg_meter.avg,
+            train_time)
+
+
+def validate(dataloader, model, base_ds, total_batch, debug_steps=100):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: criterion
+        postprocessors: postprocessor for generating bboxes
+        base_ds: COCO instance
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+    Returns:
+        val_loss_meter.avg
+        val_acc_meter.avg
+        val_time
+    """
+    model.eval()
+    time_st = time.time()
+
+    iou_types = ('bbox', )
+    coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            samples = data[0]
+            targets = data[1]
+
+            prediction = model(samples, targets)
+
+            if batch_id > 0 and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], done") 
+
+            res = {}
+            for target_id, output in zip(targets['image_id'], prediction):
+                target_id = target_id.cpu().numpy()[0]
+                output = output.cpu().numpy()
+                if output.shape[0] != 0:
+                    pred_dict = {'boxes': output[:, 2::],
+                                 'scores': output[:, 1],
+                                 'labels': output[:, 0]}
+                    res[int(target_id)] = pred_dict
+                else:
+                    res[int(target_id)] = {}
+
+            if coco_evaluator is not None:
+                coco_evaluator.update(res)
+
+    if coco_evaluator is not None:
+        coco_evaluator.synchronize_between_processes()
+        coco_evaluator.accumulate()
+        stats_dict = coco_evaluator.summarize()
+        # for det only
+        all_eval_result = stats_dict['bbox']
+
+    val_time = time.time() - time_st
+    return val_time, all_eval_result
+
+
+def main_worker(*args):
+    # 0. Preparation
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = paddle.distributed.get_world_size()
+    local_rank = paddle.distributed.get_rank()
+    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # 1. Create model
+    model = build_det_model(config)
+    model = paddle.DataParallel(model)
+    # 2. Create train and val dataloader
+    dataset_train, dataset_val = args[0], args[1]
+    total_batch_train = 0
+    if not config.EVAL:
+        dataloader_train = get_dataloader(dataset_train,
+                                      batch_size=config.DATA.BATCH_SIZE,
+                                      mode='train',
+                                      multi_gpu=True)
+        total_batch_train = len(dataloader_train)
+
+    dataloader_val = get_dataloader(dataset_val,
+                                batch_size=config.DATA.BATCH_SIZE_EVAL,
+                                mode='val',
+                                multi_gpu=True)
+    total_batch_val = len(dataloader_val)
+    base_ds = dataset_val.coco # pycocotools.coco.COCO(anno_file)
+
+    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    # 4. Define optimizer and lr_scheduler
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            )
+    else:
+        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # 5. Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+
+        # if from classification weights, add prefix 'backbone' and set state dict
+        if sum(['backbone' in key for key in model_state.keys()]) == 0:
+            logger.info(f"----- Pretrained: Load backbone from {config.MODEL.PRETRAINED}")
+            new_model_state = dict()
+            for key, val in model_state.items():
+                new_model_state['backbone.' + key] = val
+            model.set_state_dict(new_model_state)
+        else:
+            logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+            model.set_state_dict(model_state)
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+    
+    # 6. Validation
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_time, all_eval_result = validate(
+            dataloader=dataloader_val,
+            model=model,
+            base_ds=base_ds,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ)
+
+        logger.info('IoU metric: bbox')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+        logger.info(f"Val time: {val_time:.2f}")
+        return
+
+    # 6. Start training and validation
+    logging.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss_cls, train_loss_reg, train_loss_rpn_cls, train_loss_rpn_reg, train_time = train(
+            dataloader=dataloader_train,
+            model=model, 
+            base_ds=base_ds,
+            optimizer=optimizer, 
+            epoch=epoch,
+            total_batch=total_batch_train,
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss cls: {train_loss_cls:.4f}, " +
+                    f"Train Loss reg: {train_loss_reg:.4f}, " +
+                    f"Train Loss rpn cls: {train_loss_rpn_cls:.4f}, " +
+                    f"Train Loss rpn reg: {train_loss_rpn_reg:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_time, all_eval_result = validate(
+                dataloader=dataloader_val,
+                model=model,
+                base_ds=base_ds,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ)
+
+            logger.info('IoU metric: bbox')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+            logger.info(f"Val time: {val_time:.2f}")
+
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                logger.info(f"----- Save model: {model_path}.pdparams")
+                logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    if not config.EVAL:
+        dataset_train = build_coco('train', config.DATA.DATA_PATH)
+    else:
+        dataset_train = None
+    dataset_val = build_coco('val', config.DATA.DATA_PATH)
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/main_single_gpu.py b/object_detection/PVTv2/main_single_gpu.py
new file mode 100644
index 00000000..715fb0ef
--- /dev/null
+++ b/object_detection/PVTv2/main_single_gpu.py
@@ -0,0 +1,401 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""PVTv2 Det training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from coco import build_coco
+from coco import get_dataloader
+from coco_eval import CocoEvaluator
+from pvtv2_det import build_pvtv2_det as build_det_model
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+
+
+parser = argparse.ArgumentParser('PVTv2-Det')
+parser.add_argument('-cfg', type=str, default=None)
+parser.add_argument('-dataset', type=str, default=None)
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-backbone', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+parser.add_argument('-eval', action='store_true')
+arguments = parser.parse_args()
+
+log_format = "%(asctime)s %(message)s"
+logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                    format=log_format, datefmt="%m%d %I:%M:%S %p")
+
+# get default config
+config = get_config()
+# update config by arguments
+config = update_config(config, arguments)
+
+# set output folder
+if not config.EVAL:
+    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+else:
+    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+#config.freeze()
+
+if not os.path.exists(config.SAVE):
+    os.makedirs(config.SAVE, exist_ok=True)
+
+# set logging format
+logger = logging.getLogger()
+fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
+fh.setFormatter(logging.Formatter(log_format))
+logger.addHandler(fh)
+logger.info(f'config= {config}')
+
+
+def train(dataloader,
+          model,
+          base_ds,
+          optimizer,
+          epoch,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, DETR model
+        criterion: nn.Layer
+        postprocessors: nn.Layer
+        base_ds: coco api instance
+        train_loss_rpn_cls_meter.avg
+        epoch: int, current epoch
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+        accum_iter: int, num of iters for accumulating gradients
+    Returns:
+        train_loss_cls_meter.avg
+        train_loss_reg_meter.avg
+        train_loss_rpn_cls_meter.avg
+        train_loss_rpn_reg_meter.avg
+        train_time
+    """
+    model.train()
+
+    train_loss_cls_meter = AverageMeter()
+    train_loss_reg_meter = AverageMeter()
+    train_loss_rpn_cls_meter = AverageMeter()
+    train_loss_rpn_reg_meter = AverageMeter()
+
+    time_st = time.time()
+
+    #iou_types = ('bbox', )
+    #coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    for batch_id, data in enumerate(dataloader):
+        samples = data[0]
+        targets = data[1]
+            
+        loss_dict = model(samples, targets)
+        losses = sum(loss for loss in loss_dict.values())
+        losses.backward()
+
+        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+            optimizer.step()
+            optimizer.clear_grad()
+
+        # logging losses
+        batch_size = samples.tensors.shape[0]
+        train_loss_cls_meter.update(loss_dict['loss_cls'].numpy()[0], batch_size)
+        train_loss_reg_meter.update(loss_dict['loss_reg'].numpy()[0], batch_size)
+        train_loss_rpn_cls_meter.update(loss_dict['loss_rpn_cls'].numpy()[0], batch_size)
+        train_loss_rpn_reg_meter.update(loss_dict['loss_rpn_reg'].numpy()[0], batch_size)
+    
+        if batch_id > 0 and batch_id % debug_steps == 0:
+            logger.info(
+                f"Train Step[{batch_id:04d}/{total_batch:04d}], " + 
+                f"Avg loss_cls: {train_loss_cls_meter.avg:.4f}, " + 
+                f"Avg loss_reg: {train_loss_reg_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_cls: {train_loss_rpn_cls_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_reg: {train_loss_rpn_reg_meter.avg:.4f}") 
+
+    train_time = time.time() - time_st
+    return (train_loss_cls_meter.avg,
+            train_loss_reg_meter.avg,
+            train_loss_rpn_cls_meter.avg,
+            train_loss_rpn_reg_meter.avg,
+            train_time)
+
+
+def validate(dataloader, model, base_ds, total_batch, debug_steps=100):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: criterion
+        postprocessors: postprocessor for generating bboxes
+        base_ds: COCO instance
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+    Returns:
+        val_loss_meter.avg
+        val_acc_meter.avg
+        val_time
+    """
+    model.eval()
+    time_st = time.time()
+
+    iou_types = ('bbox', )
+    coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            samples = data[0]
+            targets = data[1]
+
+            prediction = model(samples, targets)
+
+            if batch_id > 0 and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], done") 
+
+            res = {}
+            for target_id, output in zip(targets['image_id'], prediction):
+                target_id = target_id.cpu().numpy()[0]
+                output = output.cpu().numpy()
+                if output.shape[0] != 0:
+                    pred_dict = {'boxes': output[:, 2::],
+                                 'scores': output[:, 1],
+                                 'labels': output[:, 0]}
+                    res[int(target_id)] = pred_dict
+                else:
+                    res[int(target_id)] = {}
+
+            if coco_evaluator is not None:
+                coco_evaluator.update(res)
+
+    if coco_evaluator is not None:
+        coco_evaluator.synchronize_between_processes()
+        coco_evaluator.accumulate()
+        stats_dict = coco_evaluator.summarize()
+        # for det only
+        all_eval_result = stats_dict['bbox']
+
+    val_time = time.time() - time_st
+    return val_time, all_eval_result
+
+
+def main():
+    # 0. Preparation
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # 1. Create model and criterion
+    model = build_det_model(config)
+    # 2. Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = build_coco('train', config.DATA.DATA_PATH)
+        dataloader_train = get_dataloader(dataset_train,
+                                      batch_size=config.DATA.BATCH_SIZE,
+                                      mode='train',
+                                      multi_gpu=False)
+
+    dataset_val = build_coco('val', config.DATA.DATA_PATH)
+    dataloader_val = get_dataloader(dataset_val,
+                                batch_size=config.DATA.BATCH_SIZE_EVAL,
+                                mode='val',
+                                multi_gpu=False)
+
+    base_ds = dataset_val.coco   # pycocotools.coco.COCO(anno_file)
+    # 3. Define lr_scheduler
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR, 
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestons,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    # 5. Define optimizer
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            )
+    else:
+        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # 5. Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+
+        # if from classification weights, add prefix 'backbone' and set state dict
+        if sum(['backbone' in key for key in model_state.keys()]) == 0:
+            logger.info(f"----- Pretrained: Load backbone from {config.MODEL.PRETRAINED}")
+            new_model_state = dict()
+            for key, val in model_state.items():
+                new_model_state['backbone.' + key] = val
+            model.set_state_dict(new_model_state)
+        else:
+            logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+            model.set_state_dict(model_state)
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+    
+    # 6. Validation
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_time, all_eval_result = validate(
+            dataloader=dataloader_val,
+            model=model,
+            base_ds=base_ds,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ)
+
+        logger.info('IoU metric: bbox')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+        logger.info(f"Val time: {val_time:.2f}")
+        return
+
+    # 8. Start training and validation
+    logging.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss_cls, train_loss_reg, train_loss_rpn_cls, train_loss_rpn_reg, train_time = train(
+            dataloader=dataloader_train,
+            model=model, 
+            base_ds=base_ds,
+            optimizer=optimizer, 
+            epoch=epoch,
+            total_batch=len(dataloader_train),
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss cls: {train_loss_cls:.4f}, " +
+                    f"Train Loss reg: {train_loss_reg:.4f}, " +
+                    f"Train Loss rpn cls: {train_loss_rpn_cls:.4f}, " +
+                    f"Train Loss rpn reg: {train_loss_rpn_reg:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_time, all_eval_result = validate(
+        	    dataloader=dataloader_val,
+        	    model=model,
+        	    base_ds=base_ds,
+        	    total_batch=len(dataloader_val),
+        	    debug_steps=config.REPORT_FREQ)
+ 
+            logger.info('IoU metric: bbox')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+            logger.info(f"Val time: {val_time:.2f}")
+
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/model_utils.py b/object_detection/PVTv2/model_utils.py
new file mode 100644
index 00000000..f8461d92
--- /dev/null
+++ b/object_detection/PVTv2/model_utils.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+from itertools import repeat
+import collections.abc
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable):
+            return x
+        return tuple(repeat(x, n))
+    return parse
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights.py
new file mode 100644
index 00000000..5aa2ffa2
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b0.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b0_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b0_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights_b1.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b1.py
new file mode 100644
index 00000000..fc8d37e4
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b1.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b1.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b1_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b1_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights_b2.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b2.py
new file mode 100644
index 00000000..a7a5b0a9
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b2.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b2.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b2_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b2_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights_b2_linear.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b2_linear.py
new file mode 100644
index 00000000..d5d7a2b4
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b2_linear.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b2_linear.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b2_li_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b2_linear_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights_b3.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b3.py
new file mode 100644
index 00000000..8115c1c2
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b3.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b3.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b3_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b3_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights_b4.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b4.py
new file mode 100644
index 00000000..c9221108
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b4.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b4.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b4_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b4_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/ported_weights/load_pytorch_weights_b5.py b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b5.py
new file mode 100644
index 00000000..3a63cc2d
--- /dev/null
+++ b/object_detection/PVTv2/ported_weights/load_pytorch_weights_b5.py
@@ -0,0 +1,356 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append('/root/.cache/torch/hub/facebookresearch_detr_master/util/')
+
+#from misc import NestedTensor as ThNestedTensor
+import os
+import argparse
+import numpy as np
+import paddle
+import torch
+from config import get_config
+from pvtv2_det import build_pvtv2_det
+from model_utils import DropPath
+
+#from pvt_det_pth.PVT.detection
+
+#import timm
+#from transformer import *
+#from config import *
+#from detr import build_detr
+from utils import NestedTensor
+from misc import NestedTensor as ThNestedTensor
+
+import misc as th_utils
+#config = get_config()
+#parser = argparse.ArgumentParser('')
+#parser.add_argument('-cfg', type=str, default='./configs/vit_large_patch16_224.yaml')
+##parser.add_argument('-dataset', type=str, default="imagenet2012")
+#parser.add_argument('-dataset', type=str, default="cifar10")
+#parser.add_argument('-batch_size', type=int, default=4)
+#parser.add_argument('-image_size', type=int, default=224)
+#parser.add_argument('-data_path', type=str, default='/dataset/imagenet/')
+#parser.add_argument('-eval', action="store_true")
+#parser.add_argument('-pretrained', type=str, default=None)
+#args = parser.parse_args()
+#
+#config = get_config()
+#config = update_config(config, args)
+#print(config)
+
+
+config = get_config('./configs/pvtv2_b5.yaml')
+
+
+def print_model_named_params(model):
+    for name, param in model.named_parameters():
+        print(name, param.shape)
+
+
+def print_model_named_buffers(model):
+    for name, buff in model.named_buffers():
+        print(name, buff.shape)
+
+
+def torch_to_paddle_mapping():
+    map1 = torch_to_paddle_mapping_backbone()
+    map2 = torch_to_paddle_mapping_neck()
+    map3 = torch_to_paddle_mapping_head()
+    map1.extend(map2)
+    map1.extend(map3)
+    return map1
+
+
+def torch_to_paddle_mapping_neck():
+    mapping = []
+    for i in range(len(config.MODEL.TRANS.OUT_INDICES)):
+        th_prefix = f'neck.lateral_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_lateral{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+        th_prefix = f'neck.fpn_convs.{i}.conv'
+        pp_prefix = f'neck.fpn_output{i+2}.conv'
+        mapping.append((th_prefix, pp_prefix))
+
+    return mapping
+
+
+def torch_to_paddle_mapping_head():
+    mapping = [
+            ('rpn_head.rpn_conv', 'rpnhead.conv'),
+            ('rpn_head.rpn_cls', 'rpnhead.objectness_logits'),
+            ('rpn_head.rpn_reg', 'rpnhead.anchor_deltas'),
+            ('roi_head.bbox_head.fc_cls', 'roihead.predictor.cls_fc'),
+            ('roi_head.bbox_head.fc_reg', 'roihead.predictor.reg_fc'),
+            ('roi_head.bbox_head.shared_fcs.0', 'roihead.predictor.forward_net.linear0'),
+            ('roi_head.bbox_head.shared_fcs.1', 'roihead.predictor.forward_net.linear1'),
+            ]
+    # Add mask head
+    
+    return mapping
+
+
+def torch_to_paddle_mapping_backbone():
+    mapping = []
+
+    for embed_idx in range(1, 5):
+        th_embed_prefix = f'backbone.patch_embed{embed_idx}'
+        pp_embed_prefix = f'backbone.patch_embedding{embed_idx}'
+
+        mapping.append((f'{th_embed_prefix}.proj',
+                        f'{pp_embed_prefix}.patch_embed'))
+        mapping.append((f'{th_embed_prefix}.norm',
+                        f'{pp_embed_prefix}.norm'))
+
+    for i in range(5):
+        mapping.append((f'backbone.norm{i}',
+                        f'backbone.norm{i}'))
+
+    block_depth = config.MODEL.TRANS.STAGE_DEPTHS # [2, 2, 2, 2]
+
+    for block_idx in range(1, len(block_depth) + 1):
+        th_block_prefix = f'backbone.block{block_idx}'
+        pp_block_prefix = f'backbone.block{block_idx}'
+
+        for layer_idx in range(block_depth[block_idx-1]):
+            th_prefix = f'{th_block_prefix}.{layer_idx}'
+            pp_prefix = f'{pp_block_prefix}.{layer_idx}'
+            layer_mapping = [
+                (f'{th_prefix}.norm1', f'{pp_prefix}.norm1'),
+                (f'{th_prefix}.attn.q', f'{pp_prefix}.attn.q'),
+                (f'{th_prefix}.attn.kv', f'{pp_prefix}.attn.kv'),
+                (f'{th_prefix}.attn.proj', f'{pp_prefix}.attn.proj'),
+                (f'{th_prefix}.attn.sr', f'{pp_prefix}.attn.sr'),
+                (f'{th_prefix}.attn.norm', f'{pp_prefix}.attn.norm'),
+                (f'{th_prefix}.norm2', f'{pp_prefix}.norm2'),
+                (f'{th_prefix}.mlp.fc1', f'{pp_prefix}.mlp.fc1'),
+                (f'{th_prefix}.mlp.fc2', f'{pp_prefix}.mlp.fc2'),
+                (f'{th_prefix}.mlp.dwconv.dwconv', f'{pp_prefix}.mlp.dwconv.dwconv'),
+            ]
+            mapping.extend(layer_mapping)
+    return mapping
+
+
+def convert_from_torch_state_dict(torch_model_state_dict, paddle_model):
+    def _set_value(th_name, pd_name, transpose=True):
+        th_shape = th_params[th_name].shape
+        pd_shape = tuple(pd_params[pd_name].shape) # paddle shape default type is list
+        #assert th_shape == pd_shape, f'{th_shape} != {pd_shape}'
+        print(f'***SET*** {th_name} {th_shape} ***TO*** {pd_name} {pd_shape}')
+        if isinstance(th_params[th_name], torch.nn.parameter.Parameter):
+            value = th_params[th_name].data.numpy()
+        else:
+            value = th_params[th_name].numpy()
+        if len(value.shape) == 2 and transpose:
+            value = value.transpose((1, 0))
+        pd_params[pd_name].set_value(value)
+
+    # 1. get paddle and torch model parameters
+    pd_params = {}
+    for name, param in paddle_model.named_parameters():
+        pd_params[name] = param
+    for name, buff in paddle_model.named_buffers():
+        pd_params[name] = buff
+
+    th_params = torch_model_state_dict
+
+    # 2. get name mapping pairs
+    mapping = torch_to_paddle_mapping()
+
+    # 3. set torch param values to paddle params: may needs transpose on weights
+    for th_name, pd_name in mapping:
+        if th_name in th_params.keys(): # nn.Parameters
+            _set_value(th_name, pd_name)
+        else: # weight & bias
+            if f'{th_name}.weight' in th_params.keys():
+                th_name_w = f'{th_name}.weight'
+                pd_name_w = f'{pd_name}.weight'
+                _set_value(th_name_w, pd_name_w)
+        
+            if f'{th_name}.bias' in th_params.keys():
+                th_name_b = f'{th_name}.bias'
+                pd_name_b = f'{pd_name}.bias'
+                _set_value(th_name_b, pd_name_b)
+
+    return paddle_model
+
+
+def get_nested_tensors():
+    with open('./t.npy', 'rb') as infile:
+        t = np.load(infile)
+        m = np.load(infile)
+        gts = np.load(infile, allow_pickle=True).item()
+
+    #print(t.shape)
+    #print(m.shape)
+
+    tt = torch.Tensor(t)
+    mm = torch.Tensor(m)
+    th_in = th_utils.NestedTensor(tt, mm)
+
+    ttt = paddle.to_tensor(t)
+    mmm = paddle.to_tensor(m)
+    pp_in = NestedTensor(ttt, mmm)
+
+    #print(th_in, th_in.tensors.shape)
+    #print(pp_in, pp_in.tensors.shape)
+
+    targets = {}
+    for key, gt in gts.items():
+        targets[key] = []
+        for val in gt:
+            targets[key].append(paddle.to_tensor(val))
+    pp_gt = targets
+
+
+    return pp_in, th_in, pp_gt
+
+
+
+
+#def get_nested_tensors():
+#    samples = paddle.load(path='./batch_samples_01.pdtensor')
+#    pp_in = NestedTensor(samples['tensors'], samples['mask'])
+#    pp_target = paddle.load(path='./batch_targets_01.pdtensor')
+#
+#    samples_tensor = samples['tensors'].cpu().numpy() 
+#    samples_mask = samples['mask'].cpu().numpy() 
+#    th_tensor = torch.Tensor(samples_tensor)
+#    th_mask = torch.Tensor(samples_mask)
+#    th_in = ThNestedTensor(th_tensor, th_mask)
+#    th_target = []
+#    for item in pp_target:
+#        sample_gt = dict()
+#        for key, val in item.items():
+#            th_tensor = torch.Tensor(val.cpu().numpy())
+#            sample_gt[key] = th_tensor
+#        th_target.append(sample_gt)
+#
+#    return th_in, th_target, pp_in, pp_target
+
+
+def get_nested_tensors_random():
+    x = np.random.randn(1, 3, 224, 224).astype('float32')
+    mask = np.ones([1, 224, 224])
+
+    pp_x = paddle.to_tensor(x)
+    pp_mask = paddle.to_tensor(mask)
+    pp_in = NestedTensor(pp_x, pp_mask)
+    th_tensor = torch.Tensor(x)
+    th_mask = torch.Tensor(mask)
+    th_in = ThNestedTensor(th_tensor, th_mask)
+    th_target = []
+    pp_target = []
+
+    return th_in, th_target, pp_in, pp_target
+
+
+def main():
+
+    paddle.set_device('cpu')
+
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    
+    paddle_model  = build_pvtv2_det(config)
+    paddle_model.eval()
+
+    print_model_named_params(paddle_model)
+    print_model_named_buffers(paddle_model)
+    print('------------paddle model finish ----------------------')
+
+
+    #device = torch.device('cpu')
+    #torch_model = 
+    #torch_model = torch_model.to(device)
+    #torch_model.eval()
+
+    #print_model_named_params(torch_model)
+    #print_model_named_buffers(torch_model)
+    #print('----------torch model finish------------------------')
+     
+    torch_state_dict = torch.load('./pth_weights/mask_rcnn_pvt_v2_b5_fpn_1x_coco.pth')
+    # dict_keys(['meta', 'state_dict', 'optimizer'])
+    for key, val in torch_state_dict['state_dict'].items():
+        print(key, val.shape)
+    print('----------torch model finish------------------------')
+    torch_model_state_dict = torch_state_dict['state_dict']
+
+    # convert weights
+    paddle_model = convert_from_torch_state_dict(torch_model_state_dict, paddle_model)
+
+
+    # check correctness
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors()
+    #th_in, th_target, pp_in, pp_target = get_nested_tensors_random()
+    #x = np.random.randn(1, 3, 224, 224).astype('float32')
+    #x_paddle = paddle.to_tensor(x)
+    #x_torch = torch.Tensor(x).to(device)
+
+
+
+    #print(pp_in.tensors)
+    #print(pp_in.mask)
+    #print('-------- pp in finish ------------------')
+    
+
+    #print(th_in.tensors, th_in.tensors.shape)
+    #print(th_in.mask, th_in.mask.shape)
+    #print('-------- th in finish ------------------')
+
+    # save weights for paddle model
+    model_path = os.path.join('./pvtv2_b5_maskrcnn.pdparams')
+    paddle.save(paddle_model.state_dict(), model_path)
+   
+
+
+   # pp_in, th_in, pp_gt = get_nested_tensors()
+   # print('pp_in: ', pp_in.tensors.shape)
+
+   #  out_paddle = paddle_model(pp_in, pp_gt)
+   #  print('paddle_out = ', out_paddle)
+
+
+
+
+    #loss = paddle_criterion(out_paddle, pp_gt)
+    #print('=============== loss =============')
+    #for key, val in loss.items():
+    #    print(key, val.cpu().numpy())
+
+    #print(out_paddle['pred_logits'], out_paddle['pred_logits'].shape)
+    #print(out_paddle['pred_boxes'], out_paddle['pred_boxes'].shape)
+    #print('---------- paddle out finish ------------------------')
+
+    #out_torch = torch_model(th_in)
+    #print(out_torch['pred_logits'], out_torch['pred_logits'].shape)
+    #print(out_torch['pred_boxes'], out_torch['pred_boxes'].shape)
+    #print('---------- torch out finish ------------------------')
+
+    #out_torch = out_torch.data.cpu().numpy()
+    #out_paddle = out_paddle.cpu().numpy()
+
+    #print(out_torch.shape, out_paddle.shape)
+    #print(out_torch[0:100])
+    #print(out_paddle[0:100])
+    #assert np.allclose(out_torch, out_paddle, atol = 1e-5)
+#    
+    # save weights for paddle model
+    #model_path = os.path.join('./detr_resnet50.pdparams')
+    #paddle.save(paddle_model.state_dict(), model_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/pvtv2.png b/object_detection/PVTv2/pvtv2.png
new file mode 100644
index 00000000..d9eab43c
Binary files /dev/null and b/object_detection/PVTv2/pvtv2.png differ
diff --git a/object_detection/PVTv2/pvtv2_backbone.py b/object_detection/PVTv2/pvtv2_backbone.py
new file mode 100644
index 00000000..a4bda870
--- /dev/null
+++ b/object_detection/PVTv2/pvtv2_backbone.py
@@ -0,0 +1,449 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for PVTv2
+"""
+
+import copy
+import paddle
+import paddle.nn as nn
+from model_utils import DropPath
+
+
+class Identity(nn.Layer):                      
+    """ Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+
+    """
+    def __init__(self):
+        super(Identity, self).__init__()
+ 
+    def forward(self, input):
+        return input
+
+
+class DWConv(nn.Layer):
+    """Depth-Wise convolution 3x3
+
+    Improve the local continuity of features.
+
+    """
+    def __init__(self, dim=768):
+        super(DWConv, self).__init__()
+        self.dwconv = nn.Conv2D(dim, dim, 3, 1, 1, bias_attr=True, groups=dim)
+
+    def forward(self, x, H, W):
+        B, _, C = x.shape
+        x = x.transpose([0,2,1]).reshape([B, C, H, W])
+        x = self.dwconv(x)
+        x = x.flatten(2).transpose([0,2,1])
+
+        return x
+
+
+class OverlapPatchEmbedding(nn.Layer):
+    """Overlapping Patch Embedding
+
+    Apply Overlapping Patch Embedding on input images. Embeddings is implemented using a Conv2D op.
+    Making adjacent windows overlap by half of the area, and pad the feature map with zeros to keep 
+    the resolution.
+
+    Attributes:
+        image_size: int, input image size, default: 224
+        patch_size: int, size of patch, default: 7
+        in_channels: int, input image channels, default: 3
+        embed_dim: int, embedding dimension, default: 768
+    """
+
+    def __init__(self, image_size=224, patch_size=7, stride=4, in_channels=3, embed_dim=768):
+        super().__init__()
+        image_size = (image_size, image_size) # TODO: add to_2tuple
+        patch_size = (patch_size, patch_size)
+
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.H, self.W = image_size[0] // patch_size[0], image_size[1] // patch_size[1]
+        self.num_patches = self.H * self.W
+
+        self.patch_embed = nn.Conv2D(in_channels=in_channels, 
+                                     out_channels=embed_dim, 
+                                     kernel_size=patch_size, 
+                                     stride=stride,
+                                     padding=(patch_size[0] // 2, patch_size[1] // 2))
+        self.norm = nn.LayerNorm(embed_dim, epsilon=1e-6)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.patch_embed(x) # [batch, embed_dim, h, w] h,w = patch_resolution
+        _, _, H, W = x.shape
+        x = x.flatten(start_axis=2, stop_axis=-1) # [batch, embed_dim, h*w] h*w = num_patches
+        x = x.transpose([0, 2, 1]) # [batch, h*w, embed_dim]
+        x = self.norm(x) # [batch, num_patches, embed_dim]
+
+        return x, H, W
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> dwconv -> act -> dropout -> fc -> dropout
+
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        dwconv: Depth-Wise Convolution
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+
+    def __init__(self, in_features, hidden_features, dropout=0.0, linear=False):
+        super(Mlp, self).__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(in_features,
+                             hidden_features,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+        
+        self.dwconv = DWConv(hidden_features)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(hidden_features,
+                             in_features,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+        self.linear = linear
+        if self.linear:
+            self.relu = nn.ReLU()
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        return weight_attr, bias_attr
+
+    def forward(self, x, H, W):
+        x = self.fc1(x)
+        if self.linear:
+            x = self.relu(x)
+        x = self.dwconv(x, H, W)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class Attention(nn.Layer):
+    """ Attention module
+
+    Attention module for PvT, here q, k, v are assumed the same.
+    The qkv mappings are stored as one single param.
+
+    Attributes:
+        dim: int, input dimension (channels)
+        num_heads: number of heads
+        q: a nn.Linear for q mapping
+        kv: a nn.Linear for kv mapping
+        qkv_bias: bool, if True, enable learnable bias to q,k,v, default: True
+        qk_scale: float, override default qk scale head_dim**-0.5 if set, default: None
+        attn_dropout: dropout for attention
+        proj_dropout: final dropout before output
+        softmax: softmax op for attention
+        linear: bool, if True, use linear spatial reduction attention instead of spatial reduction attention
+        sr_ratio: the spatial reduction ratio of SRA (linear spatial reduction attention)
+    """
+
+    def __init__(self, 
+                 dim, 
+                 num_heads, 
+                 qkv_bias=False, 
+                 qk_scale=None, 
+                 attention_dropout=0., 
+                 dropout=0., 
+                 sr_ratio=1, 
+                 linear=False):
+        """init Attention"""
+        super(Attention, self).__init__()
+        self.num_heads = num_heads
+        self.dim = dim
+        self.dim_head = dim // num_heads
+        self.scale = qk_scale or self.dim_head ** -0.5
+
+        self.q = nn.Linear(dim, dim, bias_attr=qkv_bias)
+        self.kv = nn.Linear(dim, dim * 2, bias_attr=qkv_bias)
+        self.attn_dropout = nn.Dropout(attention_dropout)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_dropout = nn.Dropout(dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+        self.linear = linear
+        self.sr_ratio = sr_ratio
+        if not linear:
+            if sr_ratio > 1:
+                self.sr = nn.Conv2D(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
+                self.norm = nn.LayerNorm(dim, epsilon=1e-5)
+        else:
+            self.pool = nn.AdaptiveAvgPool2D(7)
+            self.sr = nn.Conv2D(dim, dim, kernel_size=1, stride=1)
+            self.norm = nn.LayerNorm(dim, epsilon=1e-5)
+            self.act = nn.GELU()
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        return weight_attr, bias_attr
+        
+    def forward(self, x, H, W):
+        B, N, C = x.shape
+        q = self.q(x).reshape([B, N, self.num_heads, C // self.num_heads]).transpose([0, 2, 1, 3])
+
+        if not self.linear:
+            if self.sr_ratio > 1:
+                x_ = x.transpose([0, 2, 1]).reshape([B, C, H, W])
+                x_ = self.sr(x_).reshape([B, C, -1]).transpose([0, 2, 1])
+                x_ = self.norm(x_)
+                kv = self.kv(x_).reshape([B, -1, 2, self.num_heads, C // self.num_heads]).transpose([2, 0, 3, 1, 4])
+            else:
+                kv = self.kv(x).reshape([B, -1, 2, self.num_heads, C // self.num_heads]).transpose([2, 0, 3, 1, 4])
+        else:
+            x_ = x.transpose([0, 2, 1]).reshape([B, C, H, W])
+            x_ = self.sr(self.pool(x_)).reshape([B, C, -1]).transpose([0, 2, 1])
+            x_ = self.norm(x_)
+            x_ = self.act(x_)
+            kv = self.kv(x_).reshape([B, -1, 2, self.num_heads, C // self.num_heads]).transpose([2, 0, 3, 1, 4])
+        k, v = kv[0], kv[1]
+
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = attn * self.scale
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v)
+        z = z.transpose([0, 2, 1, 3])
+        new_shape = z.shape[:-2] + [self.dim]
+        z = z.reshape(new_shape)
+        z = self.proj(z)
+        z = self.proj_dropout(z)
+
+        return z
+
+
+class PvTv2Block(nn.Layer):
+    """Pyramid VisionTransformerV2 block
+
+    Contains multi head efficient self attention, droppath, mlp, norm.
+
+    Attributes:
+        dim: int, input dimension (channels)
+        num_heads: int, number of attention heads
+        mlp_ratio: float, ratio of mlp hidden dim and input embedding dim, default: 4.
+        sr_ratio: the spatial reduction ratio of SRA (linear spatial reduction attention)
+        qkv_bias: bool, if True, enable learnable bias to q,k,v, default: True
+        qk_scale: float, override default qk scale head_dim**-0.5 if set, default: None
+        dropout: float, dropout for output, default: 0.
+        attention_dropout: float, dropout of attention, default: 0.
+        drop_path: float, drop path rate, default: 0.
+    """
+
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, dropout=0., 
+                 attention_dropout=0., drop_path=0., sr_ratio=1, linear=False):
+        super(PvTv2Block, self).__init__()
+        self.norm1 = nn.LayerNorm(dim, epsilon=1e-6)
+        self.attn = Attention(dim,
+                              num_heads=num_heads, 
+                              qkv_bias=qkv_bias, 
+                              qk_scale=qk_scale,
+                              attention_dropout=attention_dropout, 
+                              dropout=dropout, 
+                              sr_ratio=sr_ratio, 
+                              linear=linear)
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
+        self.norm2 = nn.LayerNorm(dim, epsilon=1e-6)
+        self.mlp = Mlp(in_features=dim, 
+                       hidden_features=int(dim*mlp_ratio), 
+                       dropout=dropout, 
+                       linear=linear)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        return weight_attr, bias_attr
+
+    def forward(self, x, H, W):
+        x = x + self.drop_path(self.attn(self.norm1(x), H, W))
+        x = x + self.drop_path(self.mlp(self.norm2(x), H, W))
+
+        return x
+
+
+class PyramidVisionTransformerV2(nn.Layer):
+    """PyramidVisionTransformerV2 class
+
+    Attributes:
+        patch_size: int, size of patch
+        image_size: int, size of image
+        num_classes: int, num of image classes
+        in_channels: int, channel of input image
+        num_heads: int, num of heads in attention module 
+        num_stages: int, num of stages contains OverlapPatch embedding and PvTv2 blocks      
+        depths: list of int, num of PvTv2 blocks in each stage
+        mlp_ratio: float, hidden dimension of mlp layer is mlp_ratio * mlp input dim
+        sr_ratio: the spatial reduction ratio of SRA (linear spatial reduction attention)      
+        qkv_bias: bool, if True, set qkv layers have bias enabled
+        qk_scale: float, scale factor for qk.
+        embed_dims: list of int, output dimension of patch embedding
+        dropout: float, dropout rate for linear layer
+        attention_dropout: float, dropout rate for attention
+        drop_path: float, drop path rate, default: 0.
+        linear: bool, if True, use linear spatial reduction attention instead of spatial reduction attention
+        patch_embedding: PatchEmbedding, patch embedding instance
+        norm: nn.LayerNorm, norm layer applied after transformer
+        fc: nn.Linear, classifier op.
+    """
+
+    def __init__(self,
+                 image_size=224,
+                 patch_size=4,
+                 embed_dims=[32, 64, 160, 256],
+                 num_classes=1000,
+                 in_channels=3,
+                 num_heads=[1, 2, 5, 8],
+                 depths=[2, 2, 2, 2],
+                 mlp_ratio=[8, 8, 4, 4],
+                 sr_ratio=[8, 4, 2, 1],
+                 qkv_bias=True,
+                 qk_scale=None,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 drop_path=0.,
+                 linear=False,
+                 pretrained=None):
+        super(PyramidVisionTransformerV2, self).__init__()
+
+        self.patch_size = patch_size 
+        self.image_size = image_size
+        #self.num_classes = num_classes
+        self.in_channels = in_channels
+        self.num_heads = num_heads
+        self.depths = depths
+        self.num_stages = len(self.depths)
+        self.mlp_ratio = mlp_ratio 
+        self.sr_ratio = sr_ratio
+        self.qkv_bias = qkv_bias
+        self.qk_scale = qk_scale
+        self.embed_dims = embed_dims
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout 
+        self.drop_path = drop_path
+        self.linear = linear
+
+        depth_decay = [x.item() for x in paddle.linspace(0, self.drop_path, sum(self.depths))]
+        cur = 0
+
+        for i in range(self.num_stages):
+            patch_embedding = OverlapPatchEmbedding(image_size=self.image_size if i == 0 else self.image_size // (2 ** (i + 1)),
+                                                patch_size=7 if i == 0 else 3,
+                                                stride=4 if i == 0 else 2,
+                                                in_channels=self.in_channels if i == 0 else self.embed_dims[i - 1],
+                                                embed_dim=self.embed_dims[i])
+
+            block = nn.LayerList([copy.deepcopy(PvTv2Block(
+                dim=self.embed_dims[i], num_heads=self.num_heads[i], mlp_ratio=self.mlp_ratio[i], qkv_bias=self.qkv_bias, 
+                qk_scale=self.qk_scale, dropout=self.dropout, attention_dropout=self.attention_dropout, 
+                drop_path=depth_decay[cur + j], sr_ratio=self.sr_ratio[i], linear=self.linear))
+                for j in range(self.depths[i])])
+            norm = nn.LayerNorm(self.embed_dims[i], epsilon=1e-6)
+            cur += self.depths[i]
+
+            setattr(self, f"patch_embedding{i + 1}", patch_embedding)
+            setattr(self, f"block{i + 1}", block)
+            setattr(self, f"norm{i + 1}", norm)
+
+        # classification head
+        #self.head = nn.Linear(self.embed_dims[3], self.num_classes) if self.num_classes > 0 else Identity()
+
+        self.init_weights(pretrained)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
+        return weight_attr, bias_attr
+
+    def init_weights(self, pretrained=None):
+        if isinstance(pretrained, str):
+            model_state_dict = paddle.load(pretrained)
+            self.set_state_dict(model_state_dict)
+        
+    def freeze_patch_embedding(self):
+        self.patch_embedding1.requires_grad = False
+
+    def forward_features(self, x):
+        B = x.shape[0]
+        outs = []
+
+        for i in range(self.num_stages):
+            patch_embedding = getattr(self, f"patch_embedding{i + 1}")
+            block = getattr(self, f"block{i + 1}")
+            norm = getattr(self, f"norm{i + 1}")
+            x, H, W = patch_embedding(x)
+
+            for idx, blk in enumerate(block):
+                x = blk(x, H, W)
+            x = norm(x)
+            #if i != self.num_stages - 1:
+            #    x = x.reshape([B, H, W, -1]).transpose([0, 3, 1, 2])
+
+            x = x.reshape([B, H, W, -1]).transpose([0, 3, 1, 2])
+            outs.append(x)
+
+        return outs
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        #x = self.head(x)
+
+        return x
+
+
+def build_pvtv2(config):
+    model = PyramidVisionTransformerV2(
+        image_size=config.DATA.IMAGE_SIZE,
+        patch_size=config.MODEL.TRANS.PATCH_SIZE,
+        embed_dims=config.MODEL.TRANS.EMBED_DIMS,
+        num_classes=config.MODEL.NUM_CLASSES,
+        in_channels=config.MODEL.TRANS.IN_CHANNELS,
+        num_heads=config.MODEL.TRANS.NUM_HEADS,
+        depths=config.MODEL.TRANS.STAGE_DEPTHS,
+        mlp_ratio=config.MODEL.TRANS.MLP_RATIO,
+        sr_ratio=config.MODEL.TRANS.SR_RATIO,
+        qkv_bias=config.MODEL.TRANS.QKV_BIAS,
+        qk_scale=config.MODEL.TRANS.QK_SCALE,
+        dropout=config.MODEL.DROPOUT,
+        attention_dropout=config.MODEL.ATTENTION_DROPOUT,
+        drop_path=config.MODEL.DROP_PATH,
+        linear=config.MODEL.TRANS.LINEAR,
+        pretrained=None)
+        #pretrained='/workspace/ppvit_github/weights/pvtv2/pvtv2_b0.pdparams')
+    return model
diff --git a/object_detection/PVTv2/pvtv2_det.py b/object_detection/PVTv2/pvtv2_det.py
new file mode 100644
index 00000000..8478cf2b
--- /dev/null
+++ b/object_detection/PVTv2/pvtv2_det.py
@@ -0,0 +1,68 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""PVTv2 Object Detection"""
+
+import paddle
+import paddle.nn as nn
+from config import get_config
+from pvtv2_backbone import build_pvtv2
+from det_necks.fpn import FPN, LastLevelMaxPool
+from det_heads.maskrcnn_head.rpn_head import RPNHead
+from det_heads.maskrcnn_head.roi_head import RoIHead
+
+cfg = get_config()
+
+class PVTv2Det(nn.Layer):
+    def __init__(self, config):
+        super().__init__()
+        self.backbone = build_pvtv2(config)
+        self.neck = FPN(
+            in_channels=config.FPN.IN_CHANNELS,
+            out_channel=config.FPN.OUT_CHANNELS,
+            strides=config.FPN.STRIDES,
+            use_c5=config.FPN.USE_C5,
+            top_block=LastLevelMaxPool(),
+            use_bias=True
+        )
+        self.rpnhead = RPNHead(config)
+        self.roihead = RoIHead(config)
+
+        self.config = config
+    
+    def forward(self, x, gt=None):
+        feats = self.neck(self.backbone(x.tensors))
+        rpn_out = self.rpnhead(feats, gt)
+
+        if self.training and self.config.ROI.PAT_GT_AS_PRO:
+            proposals = []
+            for proposal, gt_box in zip(rpn_out[0], gt["gt_boxes"]):
+                proposals.append(paddle.concat([proposal, gt_box]))
+        else:
+            proposals = rpn_out[0]
+
+        final_out = self.roihead(feats, proposals, gt)
+        #print('final_out:', final_out)
+
+        if self.training:
+            rpn_losses = rpn_out[2]
+            # if training, final_out returns losses, now we combine the losses dicts
+            final_out.update(rpn_losses)
+
+        return final_out
+
+
+def build_pvtv2_det(config):
+    model = PVTv2Det(config)
+    return model
diff --git a/object_detection/PVTv2/random_erasing.py b/object_detection/PVTv2/random_erasing.py
new file mode 100644
index 00000000..a3f7d3b5
--- /dev/null
+++ b/object_detection/PVTv2/random_erasing.py
@@ -0,0 +1,108 @@
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    elif rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    else:
+        return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                #print(h, w)
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    #print(top, left)
+
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    #print(_get_pixels(
+                    #            self.per_pixel, self.rand_color, (chan, h, w),
+                    #            dtype=dtype))
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+def main():
+    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+    import PIL.Image as Image
+    import numpy as np
+    paddle.set_device('cpu')
+    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+    img = img / 255.0
+    img = paddle.transpose(img, [2, 0, 1])
+    new_img = re(img)
+    new_img = new_img * 255.0
+    new_img = paddle.transpose(new_img, [1, 2, 0])
+    new_img = new_img.cpu().numpy()
+    new_img = Image.fromarray(new_img.astype('uint8'))
+    new_img.save('./res.png')
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/PVTv2/run_eval.sh b/object_detection/PVTv2/run_eval.sh
new file mode 100644
index 00000000..47e7a97f
--- /dev/null
+++ b/object_detection/PVTv2/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/pvtv2_b0.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b0_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi.sh b/object_detection/PVTv2/run_eval_multi.sh
new file mode 100644
index 00000000..8609fde7
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b0.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b0_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi_b1.sh b/object_detection/PVTv2/run_eval_multi_b1.sh
new file mode 100644
index 00000000..0fc60034
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi_b1.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b1.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b1_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi_b2.sh b/object_detection/PVTv2/run_eval_multi_b2.sh
new file mode 100644
index 00000000..85b58bf8
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi_b2.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b2.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b2_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi_b2_linear.sh b/object_detection/PVTv2/run_eval_multi_b2_linear.sh
new file mode 100644
index 00000000..edd988d2
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi_b2_linear.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b2_linear.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b2_linear_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi_b3.sh b/object_detection/PVTv2/run_eval_multi_b3.sh
new file mode 100644
index 00000000..fb3f73a9
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi_b3.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b3.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b3_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi_b4.sh b/object_detection/PVTv2/run_eval_multi_b4.sh
new file mode 100644
index 00000000..22cfbcc6
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi_b4.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b4.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b4_maskrcnn'
diff --git a/object_detection/PVTv2/run_eval_multi_b5.sh b/object_detection/PVTv2/run_eval_multi_b5.sh
new file mode 100644
index 00000000..e275ed46
--- /dev/null
+++ b/object_detection/PVTv2/run_eval_multi_b5.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b5.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./pvtv2_b5_maskrcnn'
diff --git a/object_detection/PVTv2/run_train.sh b/object_detection/PVTv2/run_train.sh
new file mode 100644
index 00000000..b0eacb7b
--- /dev/null
+++ b/object_detection/PVTv2/run_train.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/pvtv2_b0.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
+-pretrained='./pvtv2_b0_maskrcnn'
diff --git a/object_detection/PVTv2/run_train_multi.sh b/object_detection/PVTv2/run_train_multi.sh
new file mode 100644
index 00000000..bc939891
--- /dev/null
+++ b/object_detection/PVTv2/run_train_multi.sh
@@ -0,0 +1,7 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/pvtv2_b0.yaml' \
+-dataset='coco' \
+-batch_size=4 \
+-data_path='/dataset/coco' \
+-pretrained='./pvtv2_b0_maskrcnn'
diff --git a/object_detection/PVTv2/transforms.py b/object_detection/PVTv2/transforms.py
new file mode 100644
index 00000000..65c1f8ca
--- /dev/null
+++ b/object_detection/PVTv2/transforms.py
@@ -0,0 +1,399 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Transforms for image data and detection targets"""
+
+import random
+import numpy as np
+import PIL
+import paddle
+import paddle.vision.transforms as T
+from paddle.vision.transforms import functional as F
+from random_erasing import RandomErasing
+from box_ops import box_xyxy_to_cxcywh
+from box_ops import box_xyxy_to_cxcywh_numpy
+
+
+def crop(image, target, region):
+    cropped_image = T.crop(image, *region)
+    target = target.copy()
+    i, j, h, w = region
+    #target['size'] = paddle.to_tensor([h, w]).cpu()
+    target['size'] = np.array([h, w], dtype='float32')
+    fields = ['labels', 'area', 'iscrowd']
+
+    if 'boxes' in target:
+        boxes = target['boxes']
+        #max_size = paddle.to_tensor([h, w], dtype='float32').cpu()
+        max_size = np.array([h, w], dtype='float32')
+        #cropped_boxes = boxes - paddle.to_tensor([j, i, j, i], dtype='float32').cpu() # box are (x1, y1, x2, y2)
+        cropped_boxes = boxes - np.array([j, i, j, i], dtype='float32') # box are (x1, y1, x2, y2)
+        #cropped_boxes = paddle.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
+        cropped_boxes = np.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
+        cropped_boxes = cropped_boxes.clip(min=0)
+        area = (cropped_boxes[:, 1, :] - cropped_boxes[:, 0, :]).prod(axis=1)
+        target['boxes'] = cropped_boxes.reshape([-1, 4])
+        target['area'] = area
+        fields.append('boxes')
+
+    if 'masks' in target:
+        target['masks'] = target['masks'][:, i:i + h, j:j + w]
+        fields.append('masks')
+
+
+    # remove the boxe or mask if the area is zero
+    if 'boxes' in target or 'masks' in target:
+        if 'boxes' in target:
+            cropped_boxes = target['boxes'].reshape((-1, 2, 2))
+            # FIXME: select indices where x2 > x1 and y2 > y1
+            # This paddle api will raise error in current env
+            #keep = paddle.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], axis=1)
+            # Instead we use numpy for temp fix
+            #cropped_boxes = cropped_boxes.cpu().numpy()
+            keep  = np.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], axis=1)
+            #keep = keep.cpu().numpy()
+        else:
+            keep = target['masks'].flatten(1).any(1)
+            #keep = keep.cpu().numpy()
+
+        keep_idx = np.where(keep)[0].astype('int32')
+        #keep = paddle.to_tensor(keep_idx).cpu()
+        keep = keep_idx
+
+        for field in fields:
+            #target[field] = target[field].index_select(keep, axis=0)
+            target[field] = target[field][keep]
+
+    return cropped_image, target
+
+
+def hflip(image, target):
+    flipped_image = T.hflip(image)
+    w, h = image.size
+    target = target.copy()
+    if 'boxes' in target:
+        boxes = target['boxes'] # n x 4
+        #boxes = boxes.index_select(paddle.to_tensor([2, 1, 0, 3], dtype='int32').cpu(), axis=1)
+        boxes = boxes[:, [2, 1, 0, 3]]
+        #boxes = boxes * paddle.to_tensor(
+        #        [-1, 1, -1, 1], dtype='float32').cpu() + paddle.to_tensor([w, 0, w, 0], dtype='float32').cpu()
+        boxes = boxes * np.array([-1, 1, -1, 1], dtype='float32') + np.array([w, 0, w, 0], dtype='float32')
+        target['boxes'] = boxes
+
+    if 'masks' in target:
+        target['masks'] = (target['masks']).flip(axis=[-1])
+
+    return flipped_image, target
+
+
+def resize(image, target, size, max_size=None):
+    def get_size_with_aspect_ratio(image_size, size, max_size=None):
+        """ get new image size for rescale, aspect ratio is kept, and longer side must < max_size
+        Args:
+            image_size: tuple/list of image width and height
+            size: length of shorter side of scaled image
+            max_size: max length of longer side of scaled image
+        Returns:
+            size: output image size in (h, w) order.
+        """
+        w, h = image_size
+        if max_size is not None:
+            min_original_size = float(min(w, h))
+            max_original_size = float(max(w, h))
+            # size is shorter side and keep the aspect ratio, if the longer side
+            # is larger than the max_size
+            if max_original_size / min_original_size * size > max_size:
+                # longer side is the max_size, shorter side size is:
+                size = int(round(max_size * min_original_size / max_original_size))
+        if (w <= h and w == size) or (h <= w and h == size):
+            return (h, w)
+
+        if w < h:
+            ow = size
+            oh = int(size * h / w)
+        else:
+            oh = size
+            ow = int(size * w / h)
+        
+        return (oh, ow)
+
+    def get_size(image_size, size, max_size=None):
+        """"get new image size to rescale
+        Args:
+            image_size: tuple, Pillow image size, (width, height)
+            size: int or list/tuple, if size is list or tuple, return
+            this size as the new image size to rescale, if size is a
+            single int, then compute the new image size by this size
+            (as shorter side) and max_size (as longer side), also keep
+            the same aspect_ratio as original image.
+            max_size: longest side max size of new image size
+        Return:
+            size: tuple, (width, height)
+        """
+        if isinstance(size, (list, tuple)):
+            return size[::-1]
+        else:
+            return get_size_with_aspect_ratio(image_size, size, max_size)
+
+    # STEP0: get new image size
+    size = get_size(image.size, size, max_size)
+    # STEP1: resize image with new size
+    rescaled_image = T.resize(image, size, interpolation='bicubic') # here size is (h, w)
+    # STEP2: resize targets
+    if target is None:
+        return rescaled_image, None
+
+    ratios = tuple(float(s) / float(s_orig) for s, s_orig in zip(rescaled_image.size, image.size))
+    ratio_width, ratio_height = ratios
+
+    target = target.copy()
+    if 'boxes' in target:
+        boxes = target['boxes']
+        if boxes.shape[0] == 0: # empty boxes
+            scaled_boxes = boxes
+        else: # this line works well in pytorch, but not in paddle
+            #scaled_boxes = boxes * paddle.to_tensor([ratio_width, ratio_height, ratio_width, ratio_height]).cpu()
+            scaled_boxes = boxes * np.array([ratio_width, ratio_height, ratio_width, ratio_height], dtype='float32')
+        target['boxes'] = scaled_boxes
+
+    if 'area' in target:
+        area = target['area']
+        scaled_area = area * (ratio_width * ratio_height)
+        target['area'] = scaled_area
+
+    h, w = size
+    #target['size'] = paddle.to_tensor([h, w]).cpu()
+    target['size'] = np.array([h, w], dtype='float32')
+
+    if 'masks' in target:
+        masks = target['masks'] # [N, H, W]
+        masks = masks.unsqueeze(-1).astype('float32') #[N, H, W, 1]
+        masks = paddle.to_tensor(masks).cpu()
+        masks = paddle.nn.functional.interpolate(
+                    masks, size, data_format='NHWC')  #[N, H', W', 1]
+        masks = masks[:, :, :, 0] > 0.5
+        masks = masks.astype('int32')
+        masks = masks.numpy()
+        target['masks'] = masks
+
+    return rescaled_image, target
+
+
+def pad(image, target, padding=None, size_divisor=None):
+
+    if size_divisor is not None:
+        pad_w = int(np.ceil(image.size[0] / size_divisor)) * size_divisor
+        pad_h = int(np.ceil(image.size[1] / size_divisor)) * size_divisor
+        padding = [pad_w - image.size[0], pad_h - image.size[1]]
+
+    #print('image size = ', image.size)
+    #print('pad_w = ', pad_w)
+    #print('pad_h = ', pad_h)
+    #print('padding = ', padding)
+
+    padded_image = T.pad(image, (0, 0, padding[0], padding[1]))
+    if target is None:
+        return padded_image, None
+    target = target.copy()
+    #target['size'] = paddle.to_tensor(padded_image.size[::-1]).cpu()
+    target['size'] = np.array(padded_image.size[::-1], dtype='float32')
+    if 'masks' in target:
+        target['masks'] = T.pad(target['masks'], (0, padding[0], 0, padding[1]))
+    return padded_image, target
+
+
+class RandomCrop():
+    def __init__(self, size):
+        self.size = size
+    
+    @staticmethod
+    def get_param(image, output_size):
+        def _get_image_size(img):
+            if F._is_pil_image(img):
+                return img.size
+            elif F._is_numpy_image(img):
+                return img.shape[:2][::-1]
+            elif F._is_tensor_image(img):
+                return img.shape[1:][::-1]  # chw
+            else:
+                raise TypeError("Unexpected type {}".format(type(img)))
+
+        w, h = _get_image_size(image)
+        th, tw = output_size
+        if w == tw and h == th:
+            return 0, 0, h, w
+
+        i = random.randint(0, h - th + 1)
+        j = random.randint(0, w - tw + 1)
+        return i, j, th, tw
+
+    def __call__(self, image, target):
+        region = RandomCrop.get_param(image, self.size)
+        return crop(image, target, region)
+
+
+class RandomSizeCrop():
+    def __init__(self, min_size, max_size):
+        self.min_size = min_size
+        self.max_size = max_size
+
+    def __call__(self, image, target):
+        w = random.randint(self.min_size, min(image.width, self.max_size))
+        h = random.randint(self.min_size, min(image.height, self.max_size))
+        region = RandomCrop.get_param(image, (h, w))
+        return crop(image, target, region)
+
+
+class CenterCrop():
+    def __init__(self, size):
+        self.size = size
+    
+    def __call__(self, image, target):
+        image_width, image_height = image.size
+        crop_height, crop_width = self.size
+        crop_top = int(round((image_height - crop_height) / 2.)) 
+        crop_left = int(round((image_width - crop_width) / 2.)) 
+        return crop(image, target, (crop_top, crop_left, crop_height, crop_width))
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+    
+    def __call__(self, image, target):
+        if random.random() < self.p:
+            return hflip(image, target)
+        return image, target
+
+
+class RandomResize():
+    def __init__(self, sizes, max_size=None):
+        assert isinstance(sizes, (list, tuple)) 
+        self.sizes = sizes
+        self.max_size = max_size
+
+    def __call__(self, image, target=None):
+        size = random.choice(self.sizes)
+        return resize(image, target, size, self.max_size)
+
+
+class Pad():
+    def __init__(self, pad_x=None, pad_y=None, size_divisor=None):
+        self.pad_x = pad_x
+        self.pad_y = pad_y
+        self.size_divisor = size_divisor
+
+    def __call__(self, image, target):
+        if self.size_divisor is not None:
+            return pad(image, target, size_divisor=self.size_divisor)
+        else:
+            return pad(image, target, (pad_x, pad_y))
+
+class RandomPad():
+    def __init__(self, max_pad):
+        self.max_pad = max_pad
+
+    def __call__(self, image, target):
+        pad_x = random.randint(0, self.max_pad)
+        pad_y = random.randint(0, self.max_pad)
+        return pad(image, target, (pad_x, pad_y))
+
+
+class RandomSelect():
+    """ Random select one the transforms to apply with probablity p"""
+    def __init__(self, transforms1, transforms2, p=0.5):
+        self.transforms1 = transforms1
+        self.transforms2 = transforms2
+        self.p = p
+        
+    def __call__(self, image, target):
+        if random.random() > self.p:
+            return self.transforms1(image, target)
+        return self.transforms2(image, target)
+
+
+class ToTensor():
+    def __call__(self, image, target):
+        return T.to_tensor(image), target
+
+
+class RandomErasing():
+    def __init__(self, *args, **kwargs):
+        self.eraser = RandomErasing(*args, **kwargs) 
+
+    def __call__(self, image, target):
+        return self.eraser(image), target
+
+
+class Normalize():
+    """Normalization for image and labels.
+
+    Specifically, image is normalized with -mean and /std,
+    boxes are converted to [cx, cy, w, h] format and scaled to 
+    [0, 1] according to image size
+    """
+
+    def __init__(self, mean, std, norm_gt=False):
+        self.mean = mean
+        self.std = std
+        self.norm_gt = norm_gt
+
+    def __call__(self, image, target=None):
+        image = T.functional.normalize(image, mean=self.mean, std=self.std)
+        if target is None:
+            return image, None
+
+        if not self.norm_gt:
+            return image, target
+
+        target = target.copy()
+        h, w = image.shape[-2:]
+        if 'boxes' in target and target['boxes'].shape[0] != 0:
+            boxes = target['boxes']
+            boxes = box_xyxy_to_cxcywh_numpy(boxes)
+            #boxes = boxes / paddle.to_tensor([w, h, w, h], dtype='float32').cpu()
+            boxes = boxes / np.array([w, h, w, h], dtype='float32')
+            target['boxes'] = boxes
+
+        return image, target
+
+
+class Compose():
+    def __init__(self, transforms):
+        self.transforms = transforms
+
+    def __call__(self, image, target):
+        for t in self.transforms:
+            image, target = t(image, target)
+        return image, target
+
+    def __repr__(self):
+        format_string = self.__class__.__name__ + "("
+        for t in self.transforms:
+            format_string += '\n'
+            format_string += '    {0}'.format(t)
+        format_string += '\n)'
+        return format_string
+
+
+
+        
+
+
+
+
+
+
+
+
diff --git a/object_detection/PVTv2/utils.py b/object_detection/PVTv2/utils.py
new file mode 100644
index 00000000..48d47ee8
--- /dev/null
+++ b/object_detection/PVTv2/utils.py
@@ -0,0 +1,225 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Utilities"""
+
+import copy
+import pickle
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+def _max_by_axis(the_list):
+    maxes = the_list[0]
+    for sublist in the_list[1:]:
+        for idx, item in enumerate(sublist):
+            maxes[idx] = max(maxes[idx], item)
+    return maxes
+
+
+class NestedTensor():
+    """Each NestedTensor has .tensor and .mask attributes, which are paddle.Tensors"""
+    def __init__(self, tensors, mask):
+        self.tensors = tensors
+        self.mask = mask
+
+    def decompose(self):
+        return self.tensors, self.mask
+
+    def __repr__(self):
+        return str(self.tensors)
+
+
+def nested_tensor_from_tensor_list(tensor_list, size_divisibility):
+    """make the batch handle different image sizes
+    
+    This method take a list of tensors with different sizes,
+    then max size is selected as the final batch size,
+    smaller samples are padded with zeros(bottom-right),
+    and corresponding masks are generated.
+
+    """
+    max_size = _max_by_axis([list(img.shape) for img in tensor_list])
+
+    if size_divisibility > 1:
+        stride = size_divisibility
+        max_size[1] = (max_size[1] + (stride -1)) // stride * stride
+        max_size[2] = (max_size[2] + (stride -1)) // stride * stride
+
+    batch_shape = [len(tensor_list)] + max_size # len is the num of images in this batch
+    b, c, h, w  = batch_shape
+    dtype = tensor_list[0].dtype
+    data_tensor = paddle.zeros(batch_shape, dtype=dtype)
+    mask = paddle.ones((b, h, w), dtype='int32')
+    # zip has broadcast for tensor and mask
+    #print('===== inside nested_tensor_from_tensor_list')
+    # zip cannot used in paddle, which will create a new tensor. in pytorch it works well
+    #for img, pad_img, m in zip(tensor_list, tensor, mask):
+    #    pad_img[: img.shape[0], : img.shape[1], : img.shape[2]] = img
+    #    m[: img.shape[0], :img.shape[1]] = 0
+    for idx in range(b):
+        s0 = tensor_list[idx].shape[0]
+        s1 = tensor_list[idx].shape[1]
+        s2 = tensor_list[idx].shape[2]
+        # direct set value raise error in current env, we use numpy to bypass
+        #data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx].cpu().numpy()
+        data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx]
+        mask[idx, : s1, : s2] = 0
+    return NestedTensor(data_tensor, mask)
+
+
+def reduce_dict(input_dict, average=True):
+    """Impl all_reduce for dict of tensors in DDP"""
+    world_size = dist.get_world_size()
+    if world_size < 2:
+        return input_dict
+    with paddle.no_grad():
+        names = []
+        values = []
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = paddle.stack(values, axis=0)
+        dist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+
+
+@paddle.no_grad()
+def accuracy(output, target, topk=(1,)):
+    if target.numel() == 0:
+        return [paddle.zeros([])]
+    maxk = max(topk)
+    batch_size = target.size(0)
+
+    _, pred = output.topk(maxk, 1, True, True)
+    pred = pred.t()
+    correct = pred.eq(target.reshape(1, -1).expand_as(pred))
+
+    res = []
+    for k in topk:
+        correct_k = correct[:k].reshape(-1).astype('float32').sum(0)
+        res.append(correct_k.mul_(100.0 / batch_size))
+    return res
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
+
+
+def all_gather(data):
+    """ run all_gather on any picklable data (do not requires tensors)
+    Args:
+        data: picklable object
+    Returns:
+        data_list: list of data gathered from each rank
+    """
+    world_size = dist.get_world_size()
+    if world_size == 1:
+        return [data]
+
+    buffer = pickle.dumps(data) #write data into Bytes and stores in buffer
+    np_buffer = np.frombuffer(buffer, dtype=np.int8)
+    tensor = paddle.to_tensor(np_buffer, dtype='int32') # uint8 doese not have many ops in paddle
+
+    # obtain Tensor size of each rank
+    local_size = paddle.to_tensor([tensor.shape[0]])
+    size_list = []
+    dist.all_gather(size_list, local_size)
+    max_size = max(size_list)
+
+    # receiving tensors from all ranks, 
+    # all_gather does not support different shape, so we use padding
+    tensor_list = []
+    if local_size != max_size:
+        padding = paddle.empty(shape=(max_size - local_size, ), dtype='int32')
+        tensor = paddle.concat((tensor, padding), axis=0)
+    dist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.astype('uint8').cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
diff --git a/object_detection/Swin/README.md b/object_detection/Swin/README.md
new file mode 100644
index 00000000..e989b00d
--- /dev/null
+++ b/object_detection/Swin/README.md
@@ -0,0 +1,176 @@
+# Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [arxiv](https://arxiv.org/abs/2103.14030) 
+
+PaddlePaddle training/validation code and pretrained models for **Swin Detection**.
+
+The official pytorch implementation is [here](https://github.com/SwinTransformer/Swin-Transformer-Object-Detection).
+
+This implementation is developed by [PaddleViT](https://github.com/BR-IDL/PaddleViT).
+
+
+
+<img src="./swin.png" alt="drawing" width="100%" height="100%"/>
+<figcaption align = "center">Swin Model Overview</figcaption>
+
+### Update 
+Update (2021-09-15): Code is released and Mask R-CNN ported weights are uploaded.
+
+## Models Zoo
+| Model | backbone  | box_mAP | Model                                                                                                                                                       |
+|-------|-----------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Mask R-CNN | Swin-T 1x |  43.7   | [google](https://drive.google.com/file/d/1OpbCH5HuIlxwakNz4PzrAlJF3CxkLSYp/view?usp=sharing)/[baidu](https://pan.baidu.com/s/18HALSo2RHMBsX-Gbsi-YOw)(qev7) |
+| Mask R-CNN | Swin-T 3x |  46.0   | [google](https://drive.google.com/file/d/1oREwIk1ORhSsJcs4Y-Cfd0XrSEfPFP3-/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1tw607oogDWQ7Iz91ItfuGQ)(m8fg) |
+| Mask R-CNN | Swin-S 3x |  48.4   | [google](https://drive.google.com/file/d/1ZPWkz0zMzHJycHd6_s2hWDHIsW8SdZcK/view?usp=sharing)/[baidu](https://pan.baidu.com/s/1ubC5_CKSq0ExQSINohukVg)(hdw5) |
+
+> - The results are evaluated on COCO validation set.
+> - 1x/3x is the 'Lr Schd' in the official repo.
+
+* Backbone model weights can be found in Swin Transformer Classification [here](../../image_classification/SwinTransformer)
+
+## Notebooks
+We provide a few notebooks in aistudio to help you get started:
+
+**\*(coming soon)\***
+
+
+## Requirements
+- Python>=3.6
+- yaml>=0.2.5
+- [PaddlePaddle](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html)>=2.1.0
+- [yacs](https://github.com/rbgirshick/yacs)>=0.1.8
+
+## Data 
+COCO2017 dataset is used in the following folder structure:
+```
+COCO dataset folder
+├── annotations
+│   ├── captions_train2017.json
+│   ├── captions_val2017.json
+│   ├── instances_train2017.json
+│   ├── instances_val2017.json
+│   ├── person_keypoints_train2017.json
+│   └── person_keypoints_val2017.json
+├── train2017
+│   ├── 000000000009.jpg
+│   ├── 000000000025.jpg
+│   ├── 000000000030.jpg
+│   ├── 000000000034.jpg
+|   ...
+└── val2017
+    ├── 000000000139.jpg
+    ├── 000000000285.jpg
+    ├── 000000000632.jpg
+    ├── 000000000724.jpg
+    ...
+```
+
+More details about the COCO dataset can be found [here](../../docs/paddlevit-coco.md) and COCO [official dataset](https://cocodataset.org/#download).
+
+## Usage
+To use the model with pretrained weights, download the `.pdparam` weight file and change related file paths in the following python scripts. The model config files are located in `./configs/`.
+
+For example, assume the downloaded weight file is stored in `./mask_rcnn_swin_tiny_patch4_window7.pdparams`, to use the `swin_t_maskrcnn` model in python:
+```python
+from config import get_config
+from swin_det import build_swin_det
+# config files in ./configs/
+config = get_config('./configs/swin_t_maskrcnn.yaml')
+# build model
+model = build_swin_det(config)
+# load pretrained weights
+model_state_dict = paddle.load('./mask_rcnn_swin_tiny_patch4_window7.pdparams')
+model.set_dict(model_state_dict)
+```
+
+## Evaluation
+To evaluate Swin detection model performance on COCO2017 with a single GPU, run the following script using command line:
+```shell
+sh run_eval.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+    -cfg=./configs/swin_t_maskrcnn.yaml \
+    -dataset=coco \
+    -batch_size=4 \
+    -data_path=/path/to/dataset/coco/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/mask_rcnn_swin_tiny_patch4_window7  # .pdparams is NOT needed
+```
+
+<details>
+
+<summary>
+Run evaluation using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_eval_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/swin_t_maskrcnn.yaml \
+    -dataset=coco \
+    -batch_size=4 \
+    -data_path=/path/to/dataset/coco/val \
+    -eval \
+    -pretrained=/path/to/pretrained/model/mask_rcnn_swin_tiny_patch4_window7  # .pdparams is NOT needed
+```
+
+</details>
+
+
+## Training
+To train the Swin detection model on COCO2017 with single GPU, run the following script using command line:
+```shell
+sh run_train.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=1 \
+python main_single_gpu.py \
+    -cfg=./configs/swin_t_maskrcnn.yaml \
+    -dataset=coco \
+    -batch_size=2 \
+    -data_path=/path/to/dataset/coco/train \
+    -pretrained=/path/to/pretrained/model/swin_tiny_patch4_window7_224.pdparams  # .pdparams is NOT needed
+```
+The `pretrained` arguments sets the pretrained backbone weights, which can be found in Swin classification [here](../../image_classification/SwinTransformer).
+<details>
+
+<summary>
+Run training using multi-GPUs:
+</summary>
+
+
+```shell
+sh run_train_multi.sh
+```
+or
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+    -cfg=./configs/swin_t_maskrcnn.yaml \
+    -dataset=coco \
+    -batch_size=2 \
+    -data_path=/path/to/dataset/coco/train \
+    -pretrained=/path/to/pretrained/model/swin_tiny_patch4_window7_224.pdparams  # .pdparams is NOT needed
+```
+The `pretrained` arguments sets the pretrained backbone weights, which can be found in Swin classification [here](../../image_classification/SwinTransformer).
+</details>
+
+## Visualization
+coming soon
+
+## Reference
+```
+@article{liu2021swin,
+  title={Swin transformer: Hierarchical vision transformer using shifted windows},
+  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
+  journal={arXiv preprint arXiv:2103.14030},
+  year={2021}
+}
+```
diff --git a/object_detection/Swin/box_ops.py b/object_detection/Swin/box_ops.py
new file mode 100644
index 00000000..66260b98
--- /dev/null
+++ b/object_detection/Swin/box_ops.py
@@ -0,0 +1,180 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import paddle
+
+
+def box_xyxy_to_cxcywh_numpy(box):
+    """convert box from top-left/bottom-right format:
+    [x0, y0, x1, y1]
+    to center-size format:
+    [center_x, center_y, width, height]
+
+    Args:
+        box: numpy array, last_dim=4, stop-left/bottom-right format boxes
+    Return:
+        numpy array, last_dim=4, center-size format boxes
+    """
+
+    #x0, y0, x1, y1 = box.unbind(-1)
+    x0 = box[:, 0]
+    y0 = box[:, 1]
+    x1 = box[:, 2]
+    y1 = box[:, 3]
+    xc = x0 + (x1-x0)/2
+    yc = y0 + (y1-y0)/2
+    w = x1 - x0
+    h = y1 - y0
+    return np.stack([xc, yc, w, h], axis=-1)
+
+
+
+def box_cxcywh_to_xyxy(box):
+    """convert box from center-size format:
+    [center_x, center_y, width, height]
+    to top-left/bottom-right format:
+    [x0, y0, x1, y1]
+
+    Args:
+        box: paddle.Tensor, last_dim=4, stores center-size format boxes
+    Return:
+        paddle.Tensor, last_dim=4, top-left/bottom-right format boxes
+    """
+
+    x_c, y_c, w, h = box.unbind(-1)
+    x0 = x_c - 0.5 * w
+    y0 = y_c - 0.5 * h
+    x1 = x_c + 0.5 * w
+    y1 = y_c + 0.5 * h
+    return paddle.stack([x0, y0, x1, y1], axis=-1)
+
+
+def box_xyxy_to_cxcywh(box):
+    """convert box from top-left/bottom-right format:
+    [x0, y0, x1, y1]
+    to center-size format:
+    [center_x, center_y, width, height]
+
+    Args:
+        box: paddle.Tensor, last_dim=4, stop-left/bottom-right format boxes
+    Return:
+        paddle.Tensor, last_dim=4, center-size format boxes
+    """
+
+    x0, y0, x1, y1 = box.unbind(-1)
+    xc = x0 + (x1-x0)/2
+    yc = y0 + (y1-y0)/2
+    w = x1 - x0
+    h = y1 - y0
+    return paddle.stack([xc, yc, w, h], axis=-1)
+
+
+def box_area(boxes):
+    """ compute area of a set of boxes in (x1, y1, x2, y2) format
+    Args:
+        boxes: paddle.Tensor, shape = Nx4, must in (x1, y1, x2, y2) format
+    Return:
+        areas: paddle.Tensor, N, areas of each box
+    """
+
+    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+
+
+def box_iou(boxes1, boxes2):
+    """compute iou of 2 sets of boxes in (x1, y1, x2, y2) format
+
+    This method returns the iou between every pair of boxes
+    in two sets of boxes.
+
+    Args:
+        boxes1: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+        boxes2: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+    Return:
+        iou: iou ratios between each pair of boxes in boxes1 and boxes2
+        union: union areas between each pair of boxes in boxes1 and boxes2
+    """
+
+    area1 = box_area(boxes1)
+    area2 = box_area(boxes2)
+
+    boxes1 = boxes1.unsqueeze(1) # N x 1 x 4
+    lt = paddle.maximum(boxes1[:, :, :2], boxes2[:, :2])
+    rb = paddle.minimum(boxes1[:, :, 2:], boxes2[:, 2:])
+
+    wh = (rb - lt).clip(min=0)
+    inter = wh[:, :, 0] * wh[:, :, 1]
+
+    union = area1.unsqueeze(1) + area2 - inter # broadcast
+
+    iou = inter / union
+    return iou, union
+
+
+def generalized_box_iou(boxes1, boxes2):
+    """Compute GIoU of each pais in boxes1 and boxes2
+
+    GIoU = IoU - |A_c - U| / |A_c|
+    where A_c is the smallest convex hull that encloses both boxes, U is the union of boxes
+    Details illustrations can be found in https://giou.stanford.edu/
+
+    Args:
+        boxes1: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+        boxes2: paddle.Tensor, shape=N x 4, boxes are stored in (x1, y1, x2, y2) format
+    Return:
+        giou: giou ratios between each pair of boxes in boxes1 and boxes2
+    """
+
+    iou, union = box_iou(boxes1, boxes2)
+
+    boxes1 = boxes1.unsqueeze(1) # N x 1 x 4
+    lt = paddle.minimum(boxes1[:, :, :2], boxes2[:, :2])
+    rb = paddle.maximum(boxes1[:, :, 2:], boxes2[:, 2:])
+
+    wh = (rb - lt).clip(min=0)
+    area = wh[:, :, 0] * wh[:, :, 1]
+
+    return iou - (area-union) / area
+
+
+def masks_to_boxes(masks):
+    """convert masks to bboxes
+
+    Args:
+        masks: paddle.Tensor, NxHxW
+    Return:
+        boxes: paddle.Tensor, Nx4
+    """
+
+    if masks.numel() == 0:
+        return paddle.zeros((0, 4))
+    h, w = masks.shape[-2:]
+    y = paddle.arange(0, h, dtype='float32')
+    x = paddle.arange(0, w, dtype='float32')
+    y, x = paddle.meshgrid(y, x)
+
+    x_mask = (masks * x.unsqueeze(0))
+    x_max = x_mask.flatten(1).max(-1)[0]
+
+    #x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)
+    x_min = paddle.where(masks == 0, paddle.ones_like(x_mask)*float(1e8), x_mask)
+    x_min = x_min.flatten(1).min(-1)[0]
+
+    y_mask = (masks * y.unsqueeze(0))
+    y_max = y_mask.flatten(1).max(-1)[0]
+    #y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]
+    y_min = paddle.where(masks == 0, paddle.ones_like(y_mask) * float(1e8), y_mask)
+    y_min = y_min.flatten(1).min(-1)[0]
+
+    return paddle.stack([x_min, y_min, x_max, y_max], 1)
diff --git a/object_detection/Swin/coco.py b/object_detection/Swin/coco.py
new file mode 100644
index 00000000..cf8ebda0
--- /dev/null
+++ b/object_detection/Swin/coco.py
@@ -0,0 +1,320 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Dataset(COCO2017) related classes and methods for DETR training and validation
+"""
+
+import os
+import copy
+import numpy as np
+from PIL import Image
+import paddle
+from pycocotools.coco import COCO
+from pycocotools import mask as coco_mask
+import transforms as T
+from utils import nested_tensor_from_tensor_list
+
+
+class CocoDetection(paddle.io.Dataset):
+    """ COCO Detection dataset
+
+    This class gets images and annotations for paddle training and validation.
+    Transform(preprocessing) can be applied in __getitem__ method.
+
+    Attributes:
+        img_folder: path where coco images is stored, e.g.{COCO_PATH}/train2017
+        anno_file: path where annotation json file is stored
+        transforms: transforms applied on data, see make_coco_transform for details
+        return_masks: if true, return coco masks, default: False (now only support False)
+    """
+
+    def __init__(self, img_folder, anno_file, transforms, return_masks):
+        super(CocoDetection, self).__init__()
+        self.coco = COCO(anno_file)
+        # coco all image ids
+        ids = list(sorted(self.coco.imgs.keys()))
+        # remove ids where anno has no bboxes
+        self.ids = self._remove_images_without_annotations(ids)
+        self._transforms = transforms
+        # prepare filters labels and put image and label to paddle tensors
+        self.prepare = ConvertCocoPolysToMasks(return_masks)
+        self.root = img_folder
+        self.ids2cats = {id:cat for id, cat in enumerate(self.coco.getCatIds())}
+        self.cats2ids = {cat:id for id, cat in enumerate(self.coco.getCatIds())}
+
+    def _remove_images_without_annotations(self, ids):
+        new_ids = []
+        rm_cnt = 0
+        for idx in ids:
+            annos = self._load_target(idx)
+            boxes = []
+            for anno in annos:
+                if 'bbox' in anno:
+                    boxes.append(anno['bbox'])
+            if len(boxes) == 0:
+                rm_cnt += 1
+                continue
+            new_ids.append(idx)
+        print(f'loading coco data, {rm_cnt} imgs without annos are removed')
+        return new_ids
+
+    def _load_image(self, idx):
+        """ Return PIL Image (RGB) according to COCO image id"""
+        path = self.coco.loadImgs(idx)[0]['file_name']
+        return Image.open(os.path.join(self.root, path)).convert('RGB')
+
+    def _load_target(self, idx):
+        """ Return image annos according to COCO image id"""
+        return self.coco.loadAnns(self.coco.getAnnIds(idx))
+
+    def _tgt2rcnn(self, target):
+        target['gt_boxes'] = target['boxes']
+        # target['gt_classes'] = target['labels']
+        gt_cats = target['labels']
+        target['gt_classes'] = np.array(
+            [self.cats2ids[int(gt_cats[i])] for i in range(len(gt_cats))], dtype='float32')
+
+        target['imgs_shape'] = target['size'].astype("float32")
+        target['scale_factor_wh'] = np.array(
+            [float(target['size'][1]) / float(target['orig_size'][1]),
+             float(target['size'][0]) / float(target['orig_size'][0])], dtype='float32')
+
+        target.pop("boxes")
+        target.pop("labels")
+        target.pop("size")
+
+        return target
+
+    def __len__(self):
+        return len(self.ids)
+
+    def __getitem__(self, idx):
+        """idx is for training image id, not COCO image id"""
+        image_id = self.ids[idx]
+        image = self._load_image(image_id)
+        target = self._load_target(image_id)
+        target = {'image_id': image_id, 'annotations': target}
+
+        image, target = self.prepare(image, target)
+        if self._transforms is not None:
+            image, target = self._transforms(image, target)
+
+        target = self._tgt2rcnn(target)
+
+        return image, target
+
+
+def convert_coco_poly_to_mask(segmentations, height, width):
+    """ Convert coco anno from polygons to image masks"""
+    masks = []
+    for polygons in segmentations:
+        rles = coco_mask.frPyObjects(polygons, height, width)
+        mask = coco_mask.decode(rles)
+        if len(mask.shape) < 3:
+            mask = mask[..., None]
+        mask = mask.any(axis=2).squeeze(-1) # w x h
+        masks.append(mask)
+    if masks:
+        masks = np.stack(masks, axis=0)
+    else:
+        mask = np.zeros((0, height, width), dtype='int32')
+    return masks
+
+
+class ConvertCocoPolysToMasks():
+    """ Prepare coco annotations to paddle tensors"""
+    def __init__(self, return_masks=False):
+        self.return_masks = return_masks
+
+    def __call__(self, image, target):
+        w, h = image.size
+        image_id = target['image_id']
+
+        anno = target['annotations']
+        anno = [obj for obj in anno if 'iscrowd' not in obj or obj['iscrowd'] == 0]
+
+        boxes = [obj['bbox'] for obj in anno]
+        boxes = np.array(boxes, dtype='float32')
+        boxes = boxes.reshape([-1, 4])
+        boxes[:, 2:] += boxes[:, :2]
+        boxes[:, 0::2].clip(0, w)
+        boxes[:, 1::2].clip(0, h)
+
+        classes = [obj['category_id'] for obj in anno]
+        classes = np.array(classes, dtype='float32') #TODO: check dtype
+
+        if self.return_masks:
+            segmentations = [obj['segmentation'] for obj in anno]
+            masks = convert_coco_poly_to_mask(segmentations, h, w)  # [N, H, W] int32 array
+
+        keypoints = None
+        if anno and 'keypoints' in anno[0]:
+            keypoints = [obj['keypoints'] for obj in anno]
+            keypoints = np.array(keypoints, dtype='float32')
+            num_keypoints = keypoints.shape[0]
+            if num_keypoints:
+                keypoints = keypoints.reshape((num_keypoints, -1, 3))
+
+        boxes_tmp = boxes
+        keep = (boxes_tmp[:, 3] > boxes_tmp[:, 1]) & (boxes_tmp[:, 2] > boxes_tmp[:, 0])
+        #keep_idx = np.where(keep)[0].astype('int32')
+
+        boxes = boxes[keep]
+        classes = classes[keep]
+
+        if self.return_masks:
+            masks = masks[keep]
+        if keypoints is not None:
+            keypoints = keypoints[keep]
+
+        target = {}
+        target['boxes'] = boxes
+        target['labels'] = classes
+        if self.return_masks:
+            target['masks'] = masks
+        if keypoints is not None:
+            target['keypoints'] = keypoints
+        target['image_id'] = image_id
+
+        area = np.array([obj['area'] for obj in anno])
+        iscrowd = np.array([obj['iscrowd'] if 'iscrowd' in obj else 0 for obj in anno])
+        target['area'] = area
+        target['iscrowd'] = iscrowd[keep]
+
+        target['orig_size'] = np.array([int(h), int(w)], dtype='float32')
+        target['size'] = np.array([int(h), int(w)], dtype='float32')
+
+        return image, target
+
+
+def make_coco_transforms(image_set):
+    """ return transforms(class defined in ./transforms.py) for coco train and val"""
+    normalize = T.Compose([
+        T.ToTensor(),
+        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
+    ])
+
+    scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
+
+    if image_set == 'train':
+        return T.Compose([
+            T.RandomHorizontalFlip(),
+            T.RandomResize(scales, max_size=1333),
+            normalize,
+        ])
+
+    if image_set == 'val':
+        return T.Compose([
+            T.RandomResize([800], max_size=1333),
+            normalize,
+        ])
+
+    raise ValueError(f'Unknown {image_set}')
+
+
+def build_coco(image_set, coco_path, masks=False):
+    """Return CocoDetection dataset according to image_set: ['train', 'val']"""
+    assert image_set in ['train', 'val'], f'image_set {image_set} not supported'
+    assert os.path.exists(coco_path), f'provided COCO path {coco_path} does not exist'
+    mode = 'instances'
+    paths = {
+        'train': (os.path.join(coco_path, 'train2017'),
+                  os.path.join(coco_path, 'annotations', f'{mode}_train2017.json')),
+        'val': (os.path.join(coco_path, 'val2017'),
+                os.path.join(coco_path, 'annotations', f'{mode}_val2017.json')),
+    }
+    img_folder, anno_file = paths[image_set]
+    dataset = CocoDetection(img_folder,
+                            anno_file,
+                            transforms=make_coco_transforms(image_set),
+                            return_masks=masks)
+    return dataset
+
+
+def get_dataloader(dataset, batch_size, mode='train', multi_gpu=False):
+    """ return dataloader on train/val set for single/multi gpu
+    Arguments:
+        dataset: paddle.io.Dataset, coco dataset
+        batch_size: int, num of samples in one batch
+        mode: str, ['train', 'val'], dataset to use
+        multi_gpu: bool, if True, DistributedBatchSampler is used for DDP
+    """
+    if multi_gpu:
+        sampler = paddle.io.DistributedBatchSampler(
+            dataset,
+            batch_size=batch_size,
+            shuffle=(mode == 'train'),
+            drop_last=True)
+        #TODO: may need to fix this drop_last of multi-gpu dataloading error
+        # currently, val may drop several samples, which will lower the performance
+        # an idea is to pad the last batch in collate_fn
+        dataloader = paddle.io.DataLoader(dataset,
+                                          batch_sampler=sampler,
+                                          collate_fn=collate_fn)
+    else:
+        dataloader = paddle.io.DataLoader(dataset,
+                                          batch_size=batch_size,
+                                          shuffle=(mode == 'train'),
+                                          collate_fn=collate_fn)
+    return dataloader
+
+
+def collate_fn(batch):
+    """Collate function for batching samples
+    Samples varies in sizes, here convert samples to NestedTensor which pads the tensor,
+    and generate the corresponding mask, so that the whole batch is of the same size.
+    """
+    # eliminate invalid data (where boxes is [] tensor)
+    old_batch_len = len(batch)
+    batch = [x for x in batch if x[1]['gt_boxes'].shape[0] != 0]
+    # try refill empty sample by other sample in current batch
+
+    new_batch_len = len(batch)
+    for i in range(new_batch_len, old_batch_len):
+        batch.append(copy.deepcopy(batch[i%new_batch_len]))
+
+    batch = list(zip(*batch)) # batch[0]: data tensor, batch[1]: targets dict
+
+    batch[0] = nested_tensor_from_tensor_list(batch[0], 32)
+
+    val_batch = [list(x.values()) for x in batch[1]]
+    key_batch = list(batch[1][0].keys())
+    tgt_batch = {}
+
+    for k, data in zip(key_batch, zip(*val_batch)):
+        if isinstance(data, (list, tuple)):
+            res = []
+            for item in data:
+                res.append(paddle.to_tensor(item))
+            tgt_batch[k] = res
+        else:
+            tgt_batch[k] = paddle.to_tensor(data)
+    
+    #batch_target = []
+    #for single_target in batch[1]:
+    #    target_tensor_dict = {}
+    #    for key, val in single_target.items():
+    #        if isinstance(val, (list, tuple)):
+    #            res = []
+    #            for item in val:
+    #                res.append(paddle.to_tensor(item))
+    #            target_tensor_dict[key] = res
+    #        else:
+    #            target_tensor_dict[key] = paddle.to_tensor(val)
+    #    batch_target.append(target_tensor_dict)
+
+
+    batch[1] = tgt_batch
+    return tuple(batch)
diff --git a/object_detection/Swin/coco_eval.py b/object_detection/Swin/coco_eval.py
new file mode 100644
index 00000000..df28f1fd
--- /dev/null
+++ b/object_detection/Swin/coco_eval.py
@@ -0,0 +1,252 @@
+import os
+import contextlib
+import copy
+import numpy as np
+
+from pycocotools.cocoeval import COCOeval
+from pycocotools.coco import COCO
+import pycocotools.mask as mask_util
+
+from utils import all_gather
+
+class CocoEvaluator():
+    def __init__(self, coco_gt, iou_types):
+        assert isinstance(iou_types, (list, tuple))
+        coco_gt = copy.deepcopy(coco_gt)
+        self.coco_gt = coco_gt
+        self.iou_types = iou_types
+        self.coco_eval = {}
+        for iou_type in iou_types:
+            self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
+        self.img_ids = []
+        self.eval_imgs = {k: [] for k in iou_types}
+
+        self.ids2cats = {id:cat for id, cat in enumerate(self.coco_gt.getCatIds())}
+        self.cats2ids = {cat:id for id, cat in enumerate(self.coco_gt.getCatIds())}
+
+    def update(self, predictions):
+        img_ids = list(np.unique(list(predictions.keys())))
+        self.img_ids.extend(img_ids)
+
+        for iou_type in self.iou_types:
+            results = self.prepare(predictions, iou_type)
+
+            with open(os.devnull, 'w') as devnull:
+                with contextlib.redirect_stdout(devnull):
+                    coco_dt = COCO.loadRes(self.coco_gt, results) if results else COCO()
+            coco_eval = self.coco_eval[iou_type]
+
+            coco_eval.cocoDt = coco_dt
+            coco_eval.params.imgIds = list(img_ids)
+            img_ids, eval_imgs = evaluate(coco_eval)
+            #print('eval_imgs shape: ', eval_imgs.shape)
+
+            self.eval_imgs[iou_type].append(eval_imgs)
+
+    def synchronize_between_processes(self):
+        for iou_type in self.iou_types:
+            self.eval_imgs[iou_type] = np.concatenate(self.eval_imgs[iou_type], 2)
+            create_common_coco_eval(self.coco_eval[iou_type],
+                                    self.img_ids,
+                                    self.eval_imgs[iou_type])
+
+    def accumulate(self):
+        for coco_eval in self.coco_eval.values():
+            coco_eval.accumulate()
+
+    def summarize(self):
+        stats_dict = {}
+        for iou_type, coco_eval in self.coco_eval.items():
+            print(f'IoU metric: {iou_type}')
+            coco_eval.summarize()
+            stats_dict[iou_type] = coco_eval.stats
+        return stats_dict
+
+    def prepare(self, predictions, iou_type):
+        if iou_type == 'bbox':
+            return self.prepare_for_coco_detection(predictions)
+        elif iou_type == 'segm':
+            return self.prepare_for_coco_segmentation(predictions)
+        elif iou_type == 'keypoints':
+            return self.prepare_for_coco_keypoint(predictions)
+        else:
+            raise ValueError(f'Unknown iou type {iou_type}')
+
+    def prepare_for_coco_detection(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+            boxes = prediction['boxes']
+            boxes = convert_to_xywh(boxes).tolist()
+            scores = prediction['scores'].tolist()
+            labels = prediction['labels'].tolist()
+            labels = [self.ids2cats[i] for i in labels]
+
+            coco_results.extend(
+                [
+                    {
+                        'image_id': original_id,
+                        'category_id': labels[k],
+                        'bbox': box,
+                        'score': scores[k],
+                    }
+                    for k, box in enumerate(boxes)
+                ]
+            )
+        return coco_results
+
+    def prepare_for_coco_segmentation(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+            scores = prediction['scores'].tolist()
+            labels = prediction['labels'].tolist()
+            masks = prediction['masks']
+            masks = masks > 0.5
+
+            rles = [
+                mask_util.encode(np.array(mask[0, :, :, np.newaxis], dtype=np.uint8, order='F'))[0]
+                for mask in masks
+            ]
+            for rle in rles:
+                rle['counts'] = rle['counts'].decode('utf-8')
+
+            coco_results.extend(
+                [
+                    {
+                        'image_id': original_id,
+                        'category_id': labels[k],
+                        'segmentation': rle,
+                        'score': scores[k],
+                    }
+                    for k, rle in enumerate(rles)
+                ]
+            )
+        return coco_results
+
+
+    def prepare_for_coco_keypoint(self, predictions):
+        coco_results = []
+        for original_id, prediction in predictions.items():
+            if len(prediction) == 0:
+                continue
+            boxes = prediction['boxes']
+            boxes = convert_to_xywh(boxes).tolist()
+            scores = prediction['scores'].tolist()
+            labels = prediction['labels'].tolist()
+            keypoints = prediction['keypoints']
+            keypoints = keypoints.flatten(start_dim=1).tolist()
+
+            coco_results.extend(
+                [
+                    {
+                        'image_id': original_id,
+                        'category_id': labels[k],
+                        'keypoints': keypoint,
+                        'score': scores[k],
+                    }
+                    for k, keypoint in enumerate(keypoints)
+                ]
+            )
+        return coco_results
+
+
+def convert_to_xywh(boxes):
+    #xmin, ymin, xmax, ymax = boxes.unbind(1)
+    #return paddle.stack((xmin, ymin, xmax - xmin, ymax - ymin), axis=1)
+    xmin, ymin, xmax, ymax = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
+    return np.stack((xmin, ymin, xmax-xmin, ymax-ymin), axis=1)
+
+
+def merge(img_ids, eval_imgs):
+    #all_img_ids = [img_ids]
+    #all_eval_imgs = [eval_imgs]
+    all_img_ids = all_gather(img_ids)
+    all_eval_imgs = all_gather(eval_imgs)
+
+    merged_img_ids = []
+    for p in all_img_ids:
+        merged_img_ids.extend(p)
+
+    merged_eval_imgs = []
+    for p in all_eval_imgs:
+        merged_eval_imgs.append(p)
+
+    merged_img_ids = np.array(merged_img_ids)
+    merged_eval_imgs = np.concatenate(merged_eval_imgs, 2)
+
+    merged_img_ids, idx = np.unique(merged_img_ids, return_index=True)
+    merged_eval_imgs = merged_eval_imgs[..., idx]
+
+    return merged_img_ids, merged_eval_imgs
+
+
+def create_common_coco_eval(coco_eval, img_ids, eval_imgs):
+    img_ids, eval_imgs = merge(img_ids, eval_imgs)
+    img_ids = list(img_ids)
+    eval_imgs = list(eval_imgs.flatten())
+
+    coco_eval.evalImgs = eval_imgs
+    coco_eval.params.imgIds = img_ids
+    coco_eval._paramsEval = copy.deepcopy(coco_eval.params)
+
+
+#################################################################
+# From pycocotools, just removed the prints and fixed
+# a Python3 bug about unicode not defined
+#################################################################
+
+
+def evaluate(self):
+    '''
+    Run per image evaluation on given images and store results (a list of dict) in self.evalImgs
+    :return: None
+    '''
+    # tic = time.time()
+    # print('Running per image evaluation...')
+    p = self.params
+    # add backward compatibility if useSegm is specified in params
+    if p.useSegm is not None:
+        p.iouType = 'segm' if p.useSegm == 1 else 'bbox'
+        print('useSegm (deprecated) is not None. Running {} evaluation'.format(p.iouType))
+    # print('Evaluate annotation type *{}*'.format(p.iouType))
+    p.imgIds = list(np.unique(p.imgIds))
+    if p.useCats:
+        p.catIds = list(np.unique(p.catIds))
+    p.maxDets = sorted(p.maxDets)
+    self.params = p
+
+
+    self._prepare()
+    # loop through images, area range, max detection number
+    catIds = p.catIds if p.useCats else [-1]
+
+    if p.iouType == 'segm' or p.iouType == 'bbox':
+        computeIoU = self.computeIoU
+    elif p.iouType == 'keypoints':
+        computeIoU = self.computeOks
+    self.ious = {
+        (imgId, catId): computeIoU(imgId, catId)
+        for imgId in p.imgIds
+        for catId in catIds}
+
+    evaluateImg = self.evaluateImg
+    maxDet = p.maxDets[-1]
+    evalImgs = [
+        evaluateImg(imgId, catId, areaRng, maxDet)
+        for catId in catIds
+        for areaRng in p.areaRng
+        for imgId in p.imgIds
+    ]
+    # this is NOT in the pycocotools code, but could be done outside
+    evalImgs = np.asarray(evalImgs).reshape(len(catIds), len(p.areaRng), len(p.imgIds))
+    self._paramsEval = copy.deepcopy(self.params)
+    # toc = time.time()
+    # print('DONE (t={:0.2f}s).'.format(toc-tic))
+    return p.imgIds, evalImgs
+
+#################################################################
+# end of straight copy from pycocotools, just removing the prints
+#################################################################
diff --git a/object_detection/Swin/config.py b/object_detection/Swin/config.py
new file mode 100644
index 00000000..242634c3
--- /dev/null
+++ b/object_detection/Swin/config.py
@@ -0,0 +1,222 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration
+
+Configuration for data, model archtecture, and training, etc.
+Config can be set by .yaml file or by argparser(limited usage)
+
+
+"""
+
+import os
+from yacs.config import CfgNode as CN
+import yaml
+
+_C = CN()
+_C.BASE = ['']
+
+# data settings
+_C.DATA = CN()
+_C.DATA.BATCH_SIZE = 8 #1024 batch_size for single GPU
+_C.DATA.BATCH_SIZE_EVAL = 1 #1024 batch_size for single GPU
+_C.DATA.WEIGHT_PATH = "./weights/mask_rcnn_swin_small_patch4_window7.pdparams"
+_C.DATA.VAL_DATA_PATH = "/dataset/coco/" # path to dataset
+_C.DATA.DATASET = 'coco' # dataset name
+_C.DATA.IMAGE_SIZE = 640 # input image size
+_C.DATA.CROP_PCT = 0.9 # input image scale ratio, scale is applied before centercrop in eval mode
+_C.DATA.NUM_WORKERS = 1 # number of data loading threads
+
+# model settings
+_C.MODEL = CN()
+_C.MODEL.TYPE = 'Swin'
+_C.MODEL.NAME = 'Swin'
+_C.MODEL.RESUME = None
+_C.MODEL.PRETRAINED = None
+_C.MODEL.NUM_CLASSES = 1000
+_C.MODEL.DROPOUT = 0.0
+_C.MODEL.ATTENTION_DROPOUT = 0.0
+_C.MODEL.DROP_PATH = 0.0 # TODO: droppath may raise cuda error on paddle.rand method
+
+# transformer settings
+_C.MODEL.TRANS = CN()
+_C.MODEL.TRANS.PRETRAIN_IMAGE_SIZE = 224
+_C.MODEL.TRANS.PATCH_SIZE = 4 # image_size = patch_size x window_size x num_windows
+_C.MODEL.TRANS.WINDOW_SIZE = 7
+_C.MODEL.TRANS.IN_CHANNELS = 3
+_C.MODEL.TRANS.EMBED_DIM = 96 # same as HIDDEN_SIZE in ViT
+_C.MODEL.TRANS.STAGE_DEPTHS = [2, 2, 18, 2]   # tiny [2, 2, 6, 2] small [2, 2, 18, 2]
+_C.MODEL.TRANS.NUM_HEADS = [3, 6, 12, 24]
+_C.MODEL.TRANS.MLP_RATIO = 4.
+_C.MODEL.TRANS.QKV_BIAS = True
+_C.MODEL.TRANS.QK_SCALE = None
+_C.MODEL.TRANS.APE = False # absolute positional embeddings
+_C.MODEL.TRANS.PATCH_NORM = True
+_C.MODEL.TRANS.OUT_INDICES = (0, 1, 2, 3)
+_C.MODEL.TRANS.FROZEN_STAGES = -1
+
+
+# fpn settings
+_C.FPN = CN()
+_C.FPN.OUT_CHANNELS = 256
+_C.FPN.IN_CHANNELS = [96, 192, 384, 768]  # [256, 512, 1024, 2048]
+_C.FPN.USE_C5 = False
+_C.FPN.STRIDES = [4, 8, 16, 32]
+
+# maskrcnn_head settings
+_C.RPN = CN()
+_C.ROI = CN()
+_C.ROI.BOX_HEAD = CN()
+
+_C.RPN.ANCHOR_SIZE = [[32], [64], [128], [256], [512]]
+_C.RPN.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+_C.RPN.STRIDES = [4, 8, 16, 32, 64]
+_C.RPN.OFFSET = 0.0
+_C.RPN.PRE_NMS_TOP_N_TRAIN = 2000
+_C.RPN.POST_NMS_TOP_N_TRAIN = 1000
+_C.RPN.PRE_NMS_TOP_N_TEST = 1000
+_C.RPN.POST_NMS_TOP_N_TEST = 1000
+_C.RPN.NMS_THRESH = 0.7
+_C.RPN.MIN_SIZE = 0.0
+_C.RPN.TOPK_AFTER_COLLECT = True
+_C.RPN.POSITIVE_THRESH = 0.7
+_C.RPN.NEGATIVE_THRESH = 0.3
+_C.RPN.BATCH_SIZE_PER_IMG = 256
+_C.RPN.POSITIVE_FRACTION = 0.5
+_C.RPN.LOW_QUALITY_MATCHES = True
+
+_C.ROI.SCORE_THRESH_INFER = 0.05
+_C.ROI.NMS_THRESH_INFER = 0.5
+_C.ROI.NMS_KEEP_TOPK_INFER = 100
+_C.ROI.NUM_ClASSES = 80
+_C.ROI.POSITIVE_THRESH = 0.5
+_C.ROI.NEGATIVE_THRESH = 0.5
+_C.ROI.BATCH_SIZE_PER_IMG = 512
+_C.ROI.POSITIVE_FRACTION = 0.25
+_C.ROI.LOW_QUALITY_MATCHES = False
+_C.ROI.BOX_HEAD.REG_WEIGHTS = [10.0, 10.0, 5.0, 5.0]
+_C.ROI.BOX_HEAD.NUM_CONV = 0
+_C.ROI.BOX_HEAD.CONV_DIM = 256
+_C.ROI.BOX_HEAD.NUM_FC = 2
+_C.ROI.BOX_HEAD.FC_DIM = 1024
+_C.ROI.SCALES = [1./4., 1./8., 1./16., 1./32, 1./64.]
+_C.ROI.ALIGN_OUTPUT_SIZE = 7
+_C.ROI.SAMPLING_RATIO = 0
+_C.ROI.CANONICAL_BOX_SIZE = 224
+_C.ROI.CANONICAL_LEVEL = 4
+_C.ROI.MIN_LEVEL = 0
+_C.ROI.MAX_LEVEL = 3
+_C.ROI.ALIGNED = True
+_C.ROI.PAT_GT_AS_PRO = True # when training set True, val set to False
+
+# training settings
+_C.TRAIN = CN()
+_C.TRAIN.LAST_EPOCH = 0
+_C.TRAIN.NUM_EPOCHS = 300
+_C.TRAIN.WARMUP_EPOCHS = 20
+_C.TRAIN.WEIGHT_DECAY = 0.05
+_C.TRAIN.BASE_LR = 0.001
+_C.TRAIN.WARMUP_START_LR = 0.0
+_C.TRAIN.END_LR = 0.0
+_C.TRAIN.GRAD_CLIP = 1.0
+_C.TRAIN.ACCUM_ITER = 2
+
+_C.TRAIN.LR_SCHEDULER = CN()
+_C.TRAIN.LR_SCHEDULER.NAME = 'warmupcosine'
+_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
+_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
+
+_C.TRAIN.OPTIMIZER = CN()
+_C.TRAIN.OPTIMIZER.NAME = 'SGD'
+_C.TRAIN.OPTIMIZER.EPS = 1e-8
+_C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)
+_C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+
+# augmentation
+_C.AUG = CN()
+_C.AUG.COLOR_JITTER = 0.4 # color jitter factor
+_C.AUG.AUTO_AUGMENT = 'rand-m9-mstd0.5-inc1'
+_C.AUG.RE_PROB = 0.25 # random earse prob
+_C.AUG.RE_MODE = 'pixel' # random earse mode
+_C.AUG.RE_COUNT = 1 # random earse count
+_C.AUG.MIXUP = 0.8 # mixup alpha, enabled if >0
+_C.AUG.CUTMIX = 1.0 # cutmix alpha, enabled if >0
+_C.AUG.CUTMIX_MINMAX = None # cutmix min/max ratio, overrides alpha
+_C.AUG.MIXUP_PROB = 1.0 # prob of mixup or cutmix when either/both is enabled
+_C.AUG.MIXUP_SWITCH_PROB = 0.5 # prob of switching cutmix when both mixup and cutmix enabled
+_C.AUG.MIXUP_MODE = 'batch' #how to apply mixup/curmix params, per 'batch', 'pair', or 'elem'
+
+# misc
+_C.SAVE = "./output"
+_C.TAG = "default"
+_C.SAVE_FREQ = 20 # freq to save chpt
+_C.REPORT_FREQ = 50 # freq to logging info
+_C.VALIDATE_FREQ = 20 # freq to do validation
+_C.SEED = 0
+_C.EVAL = False # run evaluation only
+_C.LOCAL_RANK = 0
+_C.NGPUS = -1
+
+
+def _update_config_from_file(config, cfg_file):
+    config.defrost()
+    with open(cfg_file, 'r') as infile:
+        yaml_cfg = yaml.load(infile, Loader=yaml.FullLoader)
+    for cfg in yaml_cfg.setdefault('BASE', ['']):
+        if cfg:
+            _update_config_from_file(
+                config, os.path.join(os.path.dirname(cfg_file), cfg)
+            )
+    print('merging config from {}'.format(cfg_file))
+    config.merge_from_file(cfg_file)
+    config.freeze()
+
+
+def update_config(config, args):
+    """Update config by ArgumentParser
+    Args:
+        args: ArgumentParser contains options
+    Return:
+        config: updated config
+    """
+    if args.cfg:
+        _update_config_from_file(config, args.cfg)
+    config.defrost()
+    if args.dataset:
+        config.DATA.DATASET = args.dataset
+    if args.batch_size:
+        config.DATA.BATCH_SIZE = args.batch_size
+    if args.data_path:
+        config.DATA.DATA_PATH = args.data_path
+    if args.ngpus:
+        config.NGPUS = args.ngpus
+    if args.eval:
+        config.EVAL = True
+        config.DATA.BATCH_SIZE_EVAL = args.batch_size
+    if args.pretrained:
+        config.MODEL.PRETRAINED = args.pretrained
+    if args.resume:
+        config.MODEL.RESUME = args.resume
+    if args.last_epoch:
+        config.MODEL.LAST_EPOCH = args.last_epoch
+
+    #config.freeze()
+    return config
+
+
+def get_config():
+    """Return a clone config"""
+    config = _C.clone()
+    return config
diff --git a/object_detection/Swin/configs/swin_s_maskrcnn.yaml b/object_detection/Swin/configs/swin_s_maskrcnn.yaml
new file mode 100644
index 00000000..82622464
--- /dev/null
+++ b/object_detection/Swin/configs/swin_s_maskrcnn.yaml
@@ -0,0 +1,20 @@
+DATA:
+    BATCH_SIZE: 2 
+MODEL:
+    DROPOUT: 0.1
+    TRANS:
+        PATCH_SIZE: 4
+        WINDOW_SIZE: 7
+        EMBED_DIM: 96
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [3, 6, 12, 24]
+        MLP_RATIO: 4.
+
+TRAIN:
+    BASE_LR: 1e-4
+    GRAD_CLIP: 0.1
+    WEIGHT_DECAY: 1e-4
+    NUM_EPOCHS: 300
+    
+
+
diff --git a/object_detection/Swin/configs/swin_t_maskrcnn.yaml b/object_detection/Swin/configs/swin_t_maskrcnn.yaml
new file mode 100644
index 00000000..0f31b8dd
--- /dev/null
+++ b/object_detection/Swin/configs/swin_t_maskrcnn.yaml
@@ -0,0 +1,20 @@
+DATA:
+    BATCH_SIZE: 2 
+MODEL:
+    DROPOUT: 0.1
+    TRANS:
+        PATCH_SIZE: 4
+        WINDOW_SIZE: 7
+        EMBED_DIM: 96
+        STAGE_DEPTHS: [2, 2, 6, 2]
+        NUM_HEADS: [3, 6, 12, 24]
+        MLP_RATIO: 4.
+
+TRAIN:
+    BASE_LR: 1e-4
+    GRAD_CLIP: 0.1
+    WEIGHT_DECAY: 1e-4
+    NUM_EPOCHS: 300
+    
+
+
diff --git a/object_detection/Swin/det_heads/__init__.py b/object_detection/Swin/det_heads/__init__.py
new file mode 100644
index 00000000..16a69b52
--- /dev/null
+++ b/object_detection/Swin/det_heads/__init__.py
@@ -0,0 +1,3 @@
+from . import maskrcnn_head
+from . import retinanet_head
+from . import det_utils
diff --git a/object_detection/Swin/det_heads/det_utils/box_utils.py b/object_detection/Swin/det_heads/det_utils/box_utils.py
new file mode 100644
index 00000000..4d97829f
--- /dev/null
+++ b/object_detection/Swin/det_heads/det_utils/box_utils.py
@@ -0,0 +1,325 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+from paddle.fluid.framework import in_dygraph_mode
+from paddle.fluid import core
+from paddle.fluid.layer_helper import LayerHelper
+
+def bbox2delta(src_boxes, tgt_boxes, weights=[1.0, 1.0, 1.0, 1.0]):
+    '''
+    The function is used to compute two tensor boxes difference among (x, y, w, h).
+
+    Args:
+        src_boxes (tensor): shape [N, 4].
+        tgt_boxes (tensor): shape [N, 4].
+        weights (list[float]): balance the dx, dy, dw, dh.
+    
+    Returns:
+        deltas (tensor): shape[N, 4].
+    '''
+    src_w = src_boxes[:, 2] - src_boxes[:, 0]
+    src_h = src_boxes[:, 3] - src_boxes[:, 1]
+    src_ctr_x = src_boxes[:, 0] + 0.5 * src_w
+    src_ctr_y = src_boxes[:, 1] + 0.5 * src_h
+
+    tgt_w = tgt_boxes[:, 2] - tgt_boxes[:, 0]
+    tgt_h = tgt_boxes[:, 3] - tgt_boxes[:, 1]
+    tgt_ctr_x = tgt_boxes[:, 0] + 0.5 * tgt_w
+    tgt_ctr_y = tgt_boxes[:, 1] + 0.5 * tgt_h
+
+    wx, wy, ww, wh = weights
+    dx = wx * (tgt_ctr_x - src_ctr_x) / src_w
+    dy = wy * (tgt_ctr_y - src_ctr_y) / src_h
+    dw = ww * paddle.log(tgt_w / src_w)
+    dh = wh * paddle.log(tgt_h / src_h)
+
+    deltas = paddle.stack((dx, dy, dw, dh), axis=1)
+    return deltas
+
+
+def delta2bbox(deltas, boxes, weights=[1.0, 1.0, 1.0, 1.0]):
+    '''
+    The inverse process of bbox2delta.
+    '''
+    clip_scale = math.log(1000.0 / 16)
+
+    widths = boxes[:, 2] - boxes[:, 0]
+    heights = boxes[:, 3] - boxes[:, 1]
+    ctr_x = boxes[:, 0] + 0.5 * widths
+    ctr_y = boxes[:, 1] + 0.5 * heights
+
+    wx, wy, ww, wh = weights
+    dx = deltas[:, 0::4] / wx
+    dy = deltas[:, 1::4] / wy
+    dw = deltas[:, 2::4] / ww
+    dh = deltas[:, 3::4] / wh
+    # Prevent sending too large values into paddle.exp()
+    dw = paddle.clip(dw, max=clip_scale)
+    dh = paddle.clip(dh, max=clip_scale)
+
+    pred_ctr_x = dx * widths.unsqueeze(1) + ctr_x.unsqueeze(1)
+    pred_ctr_y = dy * heights.unsqueeze(1) + ctr_y.unsqueeze(1)
+    pred_w = paddle.exp(dw) * widths.unsqueeze(1)
+    pred_h = paddle.exp(dh) * heights.unsqueeze(1)
+
+    pred_boxes = []
+    pred_boxes.append(pred_ctr_x - 0.5 * pred_w)
+    pred_boxes.append(pred_ctr_y - 0.5 * pred_h)
+    pred_boxes.append(pred_ctr_x + 0.5 * pred_w)
+    pred_boxes.append(pred_ctr_y + 0.5 * pred_h)
+    pred_boxes = paddle.stack(pred_boxes, axis=-1)
+
+    return pred_boxes
+
+
+def boxes_area(boxes):
+    '''
+    Compute boxes area.
+
+    Args:
+        boxes (tensor):  shape [M, 4] | [N, M, 4].
+
+    Returns:
+        areas (tensor): shape [M] | [N, M].
+    '''
+    assert boxes.shape[-1] == 4
+    if boxes.dim() == 2:
+        boxes_wh = boxes[:, 2:] - boxes[:, :2]
+        return (boxes_wh[:, 0] * boxes_wh[:, 1]).clip(min=0)
+
+    elif boxes.dim() == 3:
+        boxes_wh = boxes[:, :, 2:] - boxes[:, :, :2]
+        return (boxes_wh[:, :, 0] * boxes_wh[:, :, 1]).clip(min=0)
+
+    else:
+        raise ValueError("The dim of boxes must be 2 or 3!")
+
+
+def boxes_iou(boxes1, boxes2, mode='a'):
+    '''
+    Compute the ious of two boxes tensor and the coordinate format of boxes is xyxy.
+
+    Args:
+        boxes1 (tensor): when mode == 'a': shape [M, 4];  when mode == 'b': shape [M, 4]
+        boxes2 (tensor): when mode == 'a': shape [R, 4];  when mode == 'b': shape [M, 4]
+        mode (string | 'a' or 'b'): when mode == 'a': compute one to many;
+                                    when mode == 'b': compute one to one.
+
+    Returns:
+        ious (tensor): when mode == 'a': shape [M, R];  when mode == 'b': shape [M]
+    '''
+    area1 = boxes_area(boxes1)
+    area2 = boxes_area(boxes2)
+
+    if mode == 'a':
+        lt = paddle.maximum(boxes1.unsqueeze(-2)[:, :, :2], boxes2.unsqueeze(0)[:, :, :2])
+        rb = paddle.minimum(boxes1.unsqueeze(-2)[:, :, 2:], boxes2.unsqueeze(0)[:, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1]
+
+        union_area = area1.unsqueeze(-1) + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+
+    elif mode == 'b':
+        assert boxes1.shape[0] == boxes2.shape[0]
+
+        lt = paddle.maximum(boxes1[:, :2], boxes2[:, :2])
+        rb = paddle.minimum(boxes1[:, 2:], boxes2[:, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, 0] * inter_wh[:, 1]
+
+        union_area = area1 + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+        
+    else:
+        raise ValueError("Only support mode 'a' or 'b'")
+
+
+def batch_iou(boxes1, boxes2, mode='a'):
+    '''
+    Compute the ious of two boxes tensor and the coordinate format of boxes is xyxy.
+
+    Args:
+        boxes1 (tensor): when mode == 'a': shape [N, M, 4];  when mode == 'b': shape [N, M, 4]
+        boxes2 (tensor): when mode == 'a': shape [N, R, 4];  when mode == 'b': shape [N, M, 4]
+        mode (string | 'a' or 'b'): when mode == 'a': compute one to many;
+        when mode == 'b': compute one to one
+
+    Returns:
+        ious (tensor): when mode == 'a': shape [N, M, R];  when mode == 'b': shape [N, M]
+    '''
+    area1 = boxes_area(boxes1)
+    area2 = boxes_area(boxes2)
+
+    if mode == 'a':
+        lt = paddle.maximum(boxes1.unsqueeze(-2)[:, :, :, :2], boxes2.unsqueeze(1)[:, :, :, :2])
+        rb = paddle.minimum(boxes1.unsqueeze(-2)[:, :, :, 2:], boxes2.unsqueeze(1)[:, :, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, :, 0] * inter_wh[:, :, :, 1]
+
+        union_area = area1.unsqueeze(-1) + area2.unsqueeze(-2) - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+
+    elif mode == 'b':
+        assert boxes1.shape[0] == boxes2.shape[0]
+
+        lt = paddle.maximum(boxes1[:, :, :2], boxes2[:, :, :2])
+        rb = paddle.minimum(boxes1[:, :, 2:], boxes2[:, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1]
+
+        union_area = area1 + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+    else:
+        raise ValueError("Only support mode 'a' or 'b'")
+
+
+def nonempty_bbox(boxes, min_size=0, return_mask=False):
+    w = boxes[:, 2] - boxes[:, 0]
+    h = boxes[:, 3] - boxes[:, 1]
+    mask = paddle.logical_and(h > min_size, w > min_size)
+    if return_mask:
+        return mask
+    keep = paddle.nonzero(mask).flatten()
+    return keep
+
+
+def multiclass_nms(bboxes,
+                   scores,
+                   score_threshold,
+                   keep_top_k,
+                   nms_top_k=-1,
+                   nms_threshold=0.3,
+                   normalized=True,
+                   nms_eta=1.,
+                   background_label=-1,
+                   return_index=False,
+                   return_rois_num=True,
+                   rois_num=None,
+                   name=None):
+    """
+    This operator is to do multi-class non maximum suppression (NMS) on
+    boxes and scores.
+    In the NMS step, this operator greedily selects a subset of detection bounding
+    boxes that have high scores larger than score_threshold, if providing this
+    threshold, then selects the largest nms_top_k confidences scores if nms_top_k
+    is larger than -1. Then this operator pruns away boxes that have high IOU
+    (intersection over union) overlap with already selected boxes by adaptive
+    threshold NMS based on parameters of nms_threshold and nms_eta.
+    Aftern NMS step, at most keep_top_k number of total bboxes are to be kept
+    per image if keep_top_k is larger than -1.
+    Args:
+        bboxes (tensor): Two types of bboxes are supported:
+                           1. (tensor) A 3-D Tensor with shape
+                           [N, M, 4 or 8 16 24 32] represents the
+                           predicted locations of M bounding bboxes,
+                           N is the batch size. Each bounding box has four
+                           coordinate values and the layout is
+                           [xmin, ymin, xmax, ymax], when box size equals to 4.
+                           2. (tensor) A 3-D Tensor with shape [M, C, 4]
+                           M is the number of bounding boxes, C is the
+                           class number
+        scores (tensor): Two types of scores are supported:
+                           1. (tensor) A 3-D Tensor with shape [N, C, M]
+                           represents the predicted confidence predictions.
+                           N is the batch size, C is the class number, M is
+                           number of bounding boxes. For each category there
+                           are total M scores which corresponding M bounding
+                           boxes. Please note, M is equal to the 2nd dimension
+                           of BBoxes.
+                           2. (LoDTensor) A 2-D LoDTensor with shape [M, C].
+                           M is the number of bbox, C is the class number.
+                           In this case, input BBoxes should be the second
+                           case with shape [M, C, 4].
+        background_label (int): The index of background label, the background
+                                label will be ignored. If set to -1, then all
+                                categories will be considered. Default: 0
+        score_threshold (float): Threshold to filter out bounding boxes with
+                                 low confidence score. If not provided,
+                                 consider all boxes.
+        nms_top_k (int): Maximum number of detections to be kept according to
+                         the confidences after the filtering detections based
+                         on score_threshold.
+        nms_threshold (float): The threshold to be used in NMS. Default: 0.3
+        nms_eta (float): The threshold to be used in NMS. Default: 1.0
+        keep_top_k (int): Number of total bboxes to be kept per image after NMS
+                          step. -1 means keeping all bboxes after NMS step.
+        normalized (bool): Whether detections are normalized. Default: True
+        return_index(bool): Whether return selected index. Default: False
+        rois_num(Tensor): 1-D Tensor contains the number of RoIs in each image. 
+            The shape is [B] and data type is int32. B is the number of images.
+            If it is not None then return a list of 1-D Tensor. Each element 
+            is the output RoIs' number of each image on the corresponding level
+            and the shape is [B]. None by default.
+        name(str): Name of the multiclass nms op. Default: None.
+
+    Returns:
+        A tuple with two Variables: (Out, Index) if return_index is True,
+        otherwise, a tuple with one Variable(Out) is returned.
+        Out: A 2-D LoDTensor with shape [No, 6] represents the detections.
+        Each row has 6 values: [label, confidence, xmin, ymin, xmax, ymax]
+        or A 2-D LoDTensor with shape [No, 10] represents the detections.
+        Each row has 10 values: [label, confidence, x1, y1, x2, y2, x3, y3,
+        x4, y4]. No is the total number of detections.
+        If all images have not detected results, all elements in LoD will be
+        0, and output tensor is empty (None).
+        Index: Only return when return_index is True. A 2-D LoDTensor with
+        shape [No, 1] represents the selected index which type is Integer.
+        The index is the absolute value cross batches. No is the same number
+        as Out. If the index is used to gather other attribute such as age,
+        one needs to reshape the input(N, M, 1) to (N * M, 1) as first, where
+        N is the batch size and M is the number of boxes.
+    """
+    helper = LayerHelper('multiclass_nms3', **locals())
+
+    if in_dygraph_mode():
+        attrs = ('background_label', background_label, 'score_threshold',
+                 score_threshold, 'nms_top_k', nms_top_k, 'nms_threshold',
+                 nms_threshold, 'keep_top_k', keep_top_k, 'nms_eta', nms_eta,
+                 'normalized', normalized)
+
+        output, index, nms_rois_num = core.ops.multiclass_nms3(bboxes, scores,
+                                                               rois_num, *attrs)
+        if not return_index:
+            index = None
+
+        return output, nms_rois_num, index
\ No newline at end of file
diff --git a/object_detection/Swin/det_heads/det_utils/generator_utils.py b/object_detection/Swin/det_heads/det_utils/generator_utils.py
new file mode 100644
index 00000000..092c620a
--- /dev/null
+++ b/object_detection/Swin/det_heads/det_utils/generator_utils.py
@@ -0,0 +1,500 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+import paddle.nn as nn
+from paddle.fluid.framework import Variable, in_dygraph_mode
+from paddle.fluid import core
+
+class AnchorGenerator(nn.Layer):
+    """
+    Compute anchors in the standard ways described in
+    "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks".
+
+    Attributes:
+        anchor_size (list[list[float]] | list[float]):
+            If ``anchor_size`` is list[list[float]], ``anchor_size[i]`` is the list of anchor sizes
+            (i.e. sqrt of anchor area) to use for the i-th feature map.
+            If ``anchor_size`` is list[float], ``anchor_size`` is used for all feature maps.
+            Anchor anchor_size are given in absolute lengths in units of
+            the input image; they do not dynamically scale if the input image size changes.
+        aspect_ratios (list[list[float]] or list[float]): list of aspect ratios
+            (i.e. height / width) to use for anchors. Same "broadcast" rule for `sizes` applies.
+        strides (list[int]): stride of each input feature.
+        offset (float): Relative offset between the center of the first anchor and the top-left
+            corner of the image. Value has to be in [0, 1).
+            Recommend to use 0.5, which means half stride.
+    """
+
+    def __init__(self, 
+                 anchor_sizes = [[32], [64], [128], [256], [512]],
+                 aspect_ratios = [0.5, 1.0, 2.0],
+                 strides = [4, 8, 16, 32, 64],
+                 offset = 0.5):
+        super(AnchorGenerator, self).__init__()
+
+        self.anchor_sizes = anchor_sizes
+        self.aspect_ratios = aspect_ratios
+        self.strides = strides
+        self.offset = offset
+        self.base_anchors = self._compute_anchors()
+
+        assert 0. <= self.offset <= 1.0
+
+    def generate_anchors(self, 
+                        sizes = [32, 64, 128, 256, 512], 
+                        aspect_ratios = [0.5, 1.0, 2.0]):
+        """
+        Generate a tensor storing canonical anchor boxes, which are all anchor
+        boxes of different sizes and aspect_ratios centered at (0, 0).
+        We can later build the set of anchors for a full feature map by
+        shifting and tiling these tensors (see `meth:_grid_anchors`).
+        Args:
+            sizes (list[float] | tuple[float]):
+            aspect_ratios (list[float] | tuple[float]]):
+        Returns:
+            Tensor of shape (len(sizes) * len(aspect_ratios), 4) storing anchor boxes
+                in xyxy format.
+        """
+        anchors = []
+        
+        for size in sizes:
+            area = size ** 2.0
+            for ratio in aspect_ratios:
+                w = math.sqrt(area / ratio)
+                h = ratio * w
+                x0, y0, x1, y1 = -w / 2.0, -h / 2.0, w / 2.0, h / 2.0
+                anchors.append([x0, y0, x1, y1])
+        
+        return paddle.to_tensor(anchors, dtype='float32')
+    
+    def _broadcast_params(self, params, num_features):
+        if not isinstance(params[0], (list, tuple)):
+            return [params] * num_features
+        if len(params) == 1:
+            return params * num_features
+        return params
+        
+    def _compute_anchors(self):
+        sizes = self._broadcast_params(self.anchor_sizes, len(self.strides))
+        aspect_ratios = self._broadcast_params(self.aspect_ratios, len(self.strides))
+
+        base_anchors = [self.generate_anchors(s, a) for s, a in zip(sizes, aspect_ratios)]
+
+        [self.register_buffer(t.name, t, persistable=False) for t in base_anchors]
+
+        return base_anchors
+
+    def _grid_anchors(self, grid_sizes):
+        anchors = []
+
+        for grid_size, stride, base_anchor in zip(grid_sizes, self.strides, self.base_anchors):
+            grid_h, grid_w = grid_size
+
+            grid_x = paddle.arange(
+                self.offset * stride, grid_w * stride, step = stride, dtype='float32'
+            )
+            grid_y = paddle.arange(
+                self.offset * stride, grid_h * stride, step = stride, dtype='float32'
+            )
+
+            grid_y, grid_x = paddle.meshgrid(grid_y, grid_x)
+            grid_x = grid_x.reshape([-1])
+            grid_y = grid_y.reshape([-1])
+
+            grid_coord = paddle.stack([grid_x, grid_y, grid_x, grid_y], axis=1)
+
+            anchors.append((grid_coord.unsqueeze(1) + base_anchor.unsqueeze(0)).reshape([-1, 4]))
+
+        return anchors
+    
+    def forward(self, feats):
+        grid_sizes = [feat.shape[-2:] for feat in feats]
+        anchor_over_all_feat_maps = self._grid_anchors(grid_sizes)
+
+        return anchor_over_all_feat_maps
+    
+    @property
+    def num_anchors(self):
+        return [len(num_a) for num_a in self.base_anchors][0]
+
+# feats = []
+# h, w = 800., 800
+# for i in range(4):
+#     feats.append(paddle.rand([4, 256, h / (2 ** (i + 2)), w / (2 ** (i + 2))]))
+
+# anchorgenerator = AnchorGenerator()
+# res = anchorgenerator(feats)
+# print(anchorgenerator.num_anchors)
+# print(res)
+def generate_proposals(scores,
+                       bbox_deltas,
+                       im_shape,
+                       anchors,
+                       variances,
+                       pre_nms_top_n=6000,
+                       post_nms_top_n=1000,
+                       nms_thresh=0.5,
+                       min_size=0.1,
+                       eta=1.0,
+                       pixel_offset=False,
+                       return_rois_num=False,
+                       name=None):
+    """
+    **Generate proposal Faster-RCNN**
+    This operation proposes RoIs according to each box with their
+    probability to be a foreground object and 
+    the box can be calculated by anchors. Bbox_deltais and scores
+    to be an object are the output of RPN. Final proposals
+    could be used to train detection net.
+    For generating proposals, this operation performs following steps:
+    1. Transposes and resizes scores and bbox_deltas in size of
+       (H*W*A, 1) and (H*W*A, 4)
+    2. Calculate box locations as proposals candidates. 
+    3. Clip boxes to image
+    4. Remove predicted boxes with small area. 
+    5. Apply NMS to get final proposals as output.
+
+    Args:
+        scores (tensor): A 4-D Tensor with shape [N, A, H, W] represents
+            the probability for each box to be an object.
+            N is batch size, A is number of anchors, H and W are height and
+            width of the feature map. The data type must be float32.
+        bbox_deltas (tensor): A 4-D Tensor with shape [N, 4*A, H, W]
+            represents the difference between predicted box location and
+            anchor location. The data type must be float32.
+        im_shape (tensor): A 2-D Tensor with shape [N, 2] represents H, W, the
+            origin image size or input size. The data type can be float32 or 
+            float64.
+        anchors (tensor): A 4-D Tensor represents the anchors with a layout
+            of [H, W, A, 4] or [H * W * A, 4]. H and W are height and width of the feature map,
+            num_anchors is the box count of each position. Each anchor is
+            in (xmin, ymin, xmax, ymax) format an unnormalized. The data type must be float32.
+        variances (tensor): A 4-D Tensor. The expanded variances of anchors with a layout of
+            [H, W, num_priors, 4]. Each variance is in (xcenter, ycenter, w, h) format. 
+            The data type must be float32.
+        pre_nms_top_n (float): Number of total bboxes to be kept per image before NMS. 
+            The data type must be float32. `6000` by default.
+        post_nms_top_n (float): Number of total bboxes to be kept per image after NMS. The data type must be float32. 
+            `1000` by default.
+        nms_thresh (float): Threshold in NMS. The data type must be float32. `0.5` by default.
+        min_size (float): Remove predicted boxes with either height or
+            width < min_size. The data type must be float32. `0.1` by default.
+        eta (float): Apply in adaptive NMS, if adaptive `threshold > 0.5`,
+            `adaptive_threshold = adaptive_threshold * eta` in each iteration.
+        return_rois_num (bool): When setting True, it will return a 1D Tensor with shape [N, ] that includes Rois's 
+            num of each image in one batch. The N is the image's num. For example, the tensor has values [4,5] that represents
+            the first image has 4 Rois, the second image has 5 Rois. It only used in rcnn model. 
+            'False' by default. 
+        name(str, optional): For detailed information, please refer 
+            to :ref:`api_guide_Name`. Usually name is no need to set and 
+            None by default. 
+    Returns:
+        tuple:
+        A tuple with format ``(rpn_rois, rpn_roi_probs)``.
+        - **rpn_rois**: The generated RoIs. 2-D Tensor with shape ``[N, 4]`` while ``N`` is the number of RoIs. 
+            The data type is the same as ``scores``.
+        - **rpn_roi_probs**: The scores of generated RoIs. 2-D Tensor with shape ``[N, 1]`` while ``N`` is the number of RoIs. 
+            The data type is the same as ``scores``.
+    """
+    assert in_dygraph_mode()
+    assert return_rois_num, "return_rois_num should be True in dygraph mode."
+    attrs = ('pre_nms_topN', pre_nms_top_n, 'post_nms_topN', post_nms_top_n,
+                'nms_thresh', nms_thresh, 'min_size', min_size, 'eta', eta,
+                'pixel_offset', pixel_offset)
+    rpn_rois, rpn_roi_probs, rpn_rois_num = core.ops.generate_proposals_v2(
+        scores, bbox_deltas, im_shape, anchors, variances, *attrs)
+
+    return rpn_rois, rpn_roi_probs, rpn_rois_num
+
+
+class ProposalGenerator(object):
+    """
+    For each feature map, select the `pre_nms_topk` highest scoring proposals,
+    apply NMS, clip proposals, and remove small boxes. Return the `post_nms_topk`
+    highest scoring proposals among all the feature maps for each image.
+
+    Attributes:
+        pre_nms_top_n (int): number of top k scoring proposals to keep before applying NMS.
+            When RPN is run on multiple feature maps (as in FPN) this number is per
+            feature map.Default 6000
+        post_nms_top_n (int): number of top k scoring proposals to keep after applying NMS.
+            When RPN is run on multiple feature maps (as in FPN) this number is total,
+            over all feature maps.Default 1000
+        nms_thresh (float): Threshold in NMS. default 0.5
+        min_size (float): minimum proposal box side length in pixels (absolute units
+            wrt input images).
+        eta (float): Apply in adaptive NMS, if adaptive `threshold > 0.5`,
+             `adaptive_threshold = adaptive_threshold * eta` in each iteration.
+             default 1.
+        topk_after_collect (bool): whether to adopt topk after batch 
+             collection. If topk_after_collect is true, box filter will not be 
+             used after NMS at each image in proposal generation. default false
+    """
+
+    def __init__(self,
+                 pre_nms_top_n = 6000,
+                 post_nms_top_n = 1000,
+                 nms_thresh = .5,
+                 min_size = .1,
+                 eta = 1.,
+                 topk_after_collect = False):
+        super(ProposalGenerator, self).__init__()
+        self.pre_nms_top_n = pre_nms_top_n
+        self.post_nms_top_n = post_nms_top_n
+        self.nms_thresh = nms_thresh
+        self.min_size = min_size
+        self.eta = eta
+        self.topk_after_collect = topk_after_collect
+
+    def __call__(self, scores, bbox_deltas, anchors, imgs_shape):
+        top_n = self.pre_nms_top_n if self.topk_after_collect else self.post_nms_top_n
+        variances = paddle.ones_like(anchors)
+        rpn_rois, rpn_rois_prob, rpn_rois_num = generate_proposals(
+            scores,
+            bbox_deltas,
+            imgs_shape,
+            anchors,
+            variances,
+            pre_nms_top_n=self.pre_nms_top_n,
+            post_nms_top_n=top_n,
+            nms_thresh=self.nms_thresh,
+            min_size=self.min_size,
+            eta=self.eta,
+            return_rois_num=True
+        )
+
+        return rpn_rois, rpn_rois_prob, rpn_rois_num, self.post_nms_top_n  
+
+
+def roi_align(input,
+              rois,
+              output_size,
+              spatial_scale=1.0,
+              sampling_ratio=-1,
+              rois_num=None,
+              aligned=True):
+    """
+    Region of interest align (also known as RoI align) is to perform
+    bilinear interpolation on inputs of nonuniform sizes to obtain 
+    fixed-size feature maps (e.g. 7*7).
+
+    Args:
+        input (Tensor): Input feature, 4D-Tensor with the shape of [N,C,H,W], 
+            where N is the batch size, C is the input channel, H is Height, W is weight. 
+            The data type is float32 or float64.
+        rois (Tensor): ROIs (Regions of Interest) to pool over.It should be
+            a 2-D Tensor or 2-D LoDTensor of shape (num_rois, 4), the lod level is 1. 
+            The data type is float32 or float64. Given as [[x1, y1, x2, y2], ...],
+            (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates.
+        output_size (list[int, int] | tuple[int, int]): The pooled output size(h, w), data type is int32.
+        spatial_scale (list[float32], optional): Multiplicative spatial scale factor to translate ROI coords 
+            from their input scale to the scale used when pooling. Default: 1.0
+        sampling_ratio(int32, optional): number of sampling points in the interpolation grid. 
+            If <=0, then grid points are adaptive to roi_width and pooled_w, likewise for height. Default: -1
+        rois_num (Tensor): The number of RoIs in each image. Default: None
+        name(str, optional): For detailed information, please refer
+            to :ref:`api_guide_Name`. Usually name is no need to set and
+            None by default.
+
+    Returns:
+        Tensor:
+        Output: The output of ROIAlignOp is a 4-D tensor with shape (num_rois, channels, pooled_h, pooled_w).
+            The data type is float32 or float64.
+    """
+
+    if isinstance(output_size, int):
+        output_size = (output_size, output_size)
+
+    pooled_height, pooled_width = output_size
+
+    if in_dygraph_mode():
+        assert rois_num is not None, "rois_num should not be None in dygraph mode."
+        align_out = core.ops.roi_align(
+            input, rois, rois_num, "pooled_height", pooled_height,
+            "pooled_width", pooled_width, "spatial_scale", spatial_scale,
+            "sampling_ratio", sampling_ratio, "aligned", aligned)
+
+        return align_out
+
+
+def distribute_fpn_proposals(fpn_rois,
+                             min_level,
+                             max_level,
+                             refer_level,
+                             refer_scale,
+                             pixel_offset=False,
+                             rois_num=None):
+    """
+    
+    **This op only takes LoDTensor as input.** In Feature Pyramid Networks 
+    (FPN) models, it is needed to distribute all proposals into different FPN 
+    level, with respect to scale of the proposals, the referring scale and the 
+    referring level. Besides, to restore the order of proposals, we return an 
+    array which indicates the original index of rois in current proposals. 
+
+    Args:
+        fpn_rois(tensor): 2-D Tensor with shape [N, 4] and data type is 
+            float32 or float64. The input fpn_rois.
+        min_level(int32): The lowest level of FPN layer where the proposals come 
+            from.
+        max_level(int32): The highest level of FPN layer where the proposals
+            come from.
+        refer_level(int32): The referring level of FPN layer with specified scale.
+        refer_scale(int32): The referring scale of FPN layer with specified level.
+        rois_num(tensor): 1-D Tensor contains the number of RoIs in each image. 
+            The shape is [B] and data type is int32. B is the number of images.
+            If it is not None then return a list of 1-D Tensor. Each element 
+            is the output RoIs' number of each image on the corresponding level
+            and the shape is [B]. None by default. 
+
+    Returns:
+        Tuple:
+        multi_rois(list[tensor]) : A list of 2-D LoDTensor with shape [M, 4] 
+        and data type of float32 and float64. The length is 
+        max_level-min_level+1. The proposals in each FPN level.
+        restore_ind(tensor): A 2-D Tensor with shape [N, 1], N is 
+        the number of total rois. The data type is int32. It is
+        used to restore the order of fpn_rois.
+        rois_num_per_level(list(tensor)): A list of 1-D Tensor and each Tensor is 
+        the RoIs' number in each image on the corresponding level. The shape 
+        is [B] and data type of int32. B is the number of images.
+
+    """
+    num_lvl = max_level - min_level + 1
+
+    if in_dygraph_mode():
+        assert rois_num is not None, "rois_num should not be None in dygraph mode."
+        attrs = ('min_level', min_level, 'max_level', max_level, 'refer_level',
+                 refer_level, 'refer_scale', refer_scale, 'pixel_offset',
+                 pixel_offset)
+        multi_rois, restore_ind, rois_num_per_level = core.ops.distribute_fpn_proposals(
+            fpn_rois, rois_num, num_lvl, num_lvl, *attrs)
+
+        return multi_rois, restore_ind, rois_num_per_level
+
+
+class RoIAlign(object):
+    '''
+    Region of interest feature map pooler that supports pooling from 
+    one or more feature maps.
+    '''
+    def __init__(
+        self,
+        output_size,
+        scales,
+        sampling_ratio,
+        canonical_box_size=224,
+        canonical_level=4,
+        min_level=0,
+        max_level=3,
+        aligned=True
+    ):
+        '''
+        Attributes:
+            output_size (int): output size of the pooled region.
+            scales (list[float]): The scale for each low-level pooling op relative to
+                the input image. For a feature map with stride s relative to the input
+                image, scale is defined as 1/s. The stride must be power of 2.
+                When there are multiple scales, they must form a pyramid, i.e. they must be
+                a monotically decreasing geometric sequence with a factor of 1/2.
+            sampling_ratio (int): The `sampling_ratio` parameter for the ROIAlign op.
+            canonical_box_size (int): A canonical box size in pixels (sqrt(box area)). The default
+                is heuristically defined as 224 pixels in the FPN paper (based on ImageNet
+                pre-training).
+            canonical_level (int): The feature map level index from which a canonically-sized box
+                should be placed. The default is defined as level 4 (stride=16) in the FPN paper,
+                i.e., a box of size 224x224 will be placed on the feature with stride=16.
+                The box placement for all boxes will be determined from their sizes w.r.t
+                canonical_box_size. For example, a box whose area is 4x that of a canonical box
+                should be used to pool features from feature level ``canonical_level+1``.
+                Note that the actual input feature maps given to this module may not have
+                sufficiently many levels for the input boxes. If the boxes are too large or too
+                small for the input feature maps, the closest level will be used.
+            start_level (int): The start level of FPN layer to extract RoI feature, default 0.
+            end_level (int): The end level of FPN layer to extract RoI feature, default 3.
+            aligned (bool): Whether to add offset to rois' coord in roi_align. default True.
+        '''
+        super(RoIAlign, self).__init__()
+        self.output_size = output_size
+        self.scales = scales
+        self.sampling_ratio = sampling_ratio
+        self.canonical_box_size = canonical_box_size
+        self.canonical_level = canonical_level
+        self.min_level = min_level
+        self.max_level = max_level
+        self.aligned = aligned
+    
+    def __call__(self, feats, rois, rois_num):
+        '''
+        Args:
+            feats (list[tensor]): features from fpn.
+            rois (list[tensor]): proposals from rpn.
+            rois_num (list[int]): the number of each img's proposals.
+        
+        Returns:
+            roi_features (tensor): A tensor of shape (M, C, output_size, output_size)
+            where M is the total number of boxes aggregated over all N batch images
+            and C is the number of channels in `x`.
+        '''
+        if isinstance(rois_num, list):
+            rois_num = paddle.to_tensor(rois_num).astype("int32")
+        rois = paddle.concat(rois)
+
+        if len(feats) == 1:
+            roi_features = roi_align(
+                feats[self.min_level],
+                rois,
+                self.output_size,
+                self.scales[0],
+                self.sampling_ratio,
+                rois_num=rois_num,
+                aligned=self.aligned
+            )
+
+        else:
+            rois_per_level, original_ind, rois_num_per_level = distribute_fpn_proposals(
+                rois,
+                self.min_level + 2,
+                self.max_level + 2,
+                self.canonical_level,
+                self.canonical_box_size,
+                rois_num=rois_num
+            )
+
+            roi_features_per_level = []
+
+            for l in range(self.min_level, self.max_level + 1):
+                roi_feats = roi_align(
+                    feats[l],
+                    rois_per_level[l],
+                    self.output_size,
+                    self.scales[l],
+                    self.sampling_ratio,
+                    rois_num=rois_num_per_level[l],
+                    aligned = self.aligned
+                )
+
+                roi_features_per_level.append(roi_feats)
+            
+            roi_features = paddle.gather(
+                paddle.concat(roi_features_per_level),
+                original_ind
+            )
+        
+        return roi_features
+
diff --git a/object_detection/Swin/det_heads/det_utils/target_assign.py b/object_detection/Swin/det_heads/det_utils/target_assign.py
new file mode 100644
index 00000000..05f52019
--- /dev/null
+++ b/object_detection/Swin/det_heads/det_utils/target_assign.py
@@ -0,0 +1,304 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+from .box_utils import boxes_iou, bbox2delta
+
+def anchor_target_matcher(match_quality_matrix, 
+                          positive_thresh,
+                          negative_thresh,
+                          allow_low_quality_matches,
+                          low_thresh = -float("inf")):
+    '''
+    This class assigns to each predicted "element" (e.g., a box) a ground-truth
+    element. Each predicted element will have exactly zero or one matches; each
+    ground-truth element may be matched to zero or more predicted elements.
+
+    Args:
+        match_quality_matrix (tensor): an MxN tensor, containing the pairwise quality 
+            between M ground-truth elements and N predicted elements.
+        positive_thresh (float): the positive class threshold of iou between anchors and gt.
+        negative_thresh (float): the negative class threshold of iou between anchors and gt.
+        allow_low_quality_matches (bool): if True, produce additional matches
+            for predictions with maximum match quality lower than high_threshold.
+    
+    Returns:
+        matches (tensor): a vector of length M, where matches[i] is a matched
+            ground-truth index in [0, M).
+        match_labels (tensor): a vector of length M, where pred_labels[i] indicates
+            whether a prediction is a true or false positive or ignored.
+        
+    '''
+    # matches is 1 x M, the index of anchors matching gt
+    matched_vals, matches = paddle.topk(match_quality_matrix, k = 1, axis = 0)
+    match_labels = paddle.full(matches.shape, -1, dtype = "int32")
+    neg_idx = paddle.logical_and(matched_vals > low_thresh,
+                                 matched_vals < negative_thresh)
+
+    match_labels = paddle.where(matched_vals >= positive_thresh,
+                                paddle.ones_like(match_labels), 
+                                match_labels)
+    match_labels = paddle.where(neg_idx,
+                                paddle.zeros_like(match_labels), 
+                                match_labels)
+
+    # highest_quality_foreach_gt is N x 1
+    # For each gt, find the prediction with which it has highest quality
+    if allow_low_quality_matches:
+        highest_quality_foreach_gt = match_quality_matrix.max(axis=1, keepdim=True)
+        pred_inds_with_highest_quality = paddle.logical_and(
+        match_quality_matrix > 0, match_quality_matrix == highest_quality_foreach_gt).cast('int32').sum(
+            0, keepdim=True)
+        match_labels = paddle.where(pred_inds_with_highest_quality > 0,
+                                    paddle.ones_like(match_labels),
+                                    match_labels)
+
+    matches = matches.flatten()
+    match_labels = match_labels.flatten()
+
+    return matches, match_labels
+
+
+# reference: https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/sampling.py
+def subsample_labels(labels,
+                     num_samples,
+                     positive_fraction,
+                     bg_label=0):
+    """
+    Return `num_samples` (or fewer, if not enough found)
+    random samples from `labels` which is a mixture of positives & negatives.
+    It will try to return as many positives as possible without
+    exceeding `positive_fraction * num_samples`, and then try to
+    fill the remaining slots with negatives.
+
+    Args:
+        labels (tensor): shape (N, ) label vector with values:
+            * -1: ignore
+            * bg_label: background ("negative") class
+            * otherwise: one or more foreground ("positive") classes
+        num_samples (int): The total number of labels with value >= 0 to return.
+            Values that are not sampled will be filled with -1 (ignore).
+        positive_fraction (float): The number of subsampled labels with values > 0
+            is `min(num_positives, int(positive_fraction * num_samples))`. The number
+            of negatives sampled is `min(num_negatives, num_samples - num_positives_sampled)`.
+            In order words, if there are not enough positives, the sample is filled with
+            negatives. If there are also not enough negatives, then as many elements are
+            sampled as is possible.
+        bg_label (int): label index of background ("negative") class.
+
+    Returns:
+        pos_idx, neg_idx (tensor):
+            1D vector of indices. The total length of both is `num_samples` or fewer.
+    """
+    positive = paddle.nonzero(paddle.logical_and(labels != -1, labels != bg_label))
+    negative = paddle.nonzero(labels == bg_label)
+
+    num_pos = int(num_samples * positive_fraction)
+    # protect against not enough positive examples
+    num_pos = min(positive.numel(), num_pos)
+    num_neg = num_samples - num_pos
+    # protect against not enough negative examples
+    num_neg = min(negative.numel(), num_neg)
+
+    if num_pos == 0 and num_neg == 0:
+        pos_idx = paddle.zeros([0], dtype='int32')
+        neg_idx = paddle.zeros([0], dtype='int32')
+        return pos_idx, neg_idx
+
+    # randomly select positive and negative examples
+    negative = negative.cast('int32').flatten()
+    neg_perm = paddle.randperm(negative.numel(), dtype='int32')[:int(num_neg)]
+    neg_idx = paddle.gather(negative, neg_perm)
+
+    if num_pos == 0:
+        pos_idx = paddle.zeros([0], dtype='int32')
+        return pos_idx, neg_idx
+
+    positive = positive.cast('int32').flatten()
+    pos_perm = paddle.randperm(positive.numel(), dtype='int32')[:int(num_pos)]
+    pos_idx = paddle.gather(positive, pos_perm)
+
+    return pos_idx, neg_idx
+    
+
+def anchor_target_assign(anchors,
+                         gt_boxes,
+                         positive_thresh,
+                         negative_thresh,
+                         batch_size_per_image,
+                         positive_fraction,
+                         allow_low_quality_matches=False,
+                         is_crowd=None,
+                         weights=[1., 1., 1., 1.]):
+    '''
+    Args:
+        anchors (tensor): shape [-1, 4] the sum of muti-level anchors.
+        gt_boxes (list): gt_boxes[i] is the i-th img's gt_boxes.
+        positive_thresh (float): the positive class threshold of iou between anchors and gt.
+        negative_thresh (float): the negative class threshold of iou between anchors and gt.
+        batch_size_per_image (int): number of anchors per image to sample for training.
+        positive_fraction (float): fraction of foreground anchors to sample for training.
+        allow_low_quality_matches (bool): if True, produce additional matches
+            for predictions with maximum match quality lower than high_threshold.
+        is_crowd (list | None): is_crowd[i] is is_crowd label of the i-th img's gt_boxes.
+        weights (list): more detail please see bbox2delta.
+
+    Returns:
+        tgt_labels (list[tensor]): tgt_labels[i].shape is [Ni], the label(positive or negative) of anchors.
+        tgt_bboxes (list[tensor]): tgt_bboxes[i].shape is [Ni, 4], the matched gt_boxes.
+        tgt_deltas (list[tensor]): tgt_deltas[i].shape is [Ni, 4], the deltas between anchors and gt_boxes.
+    '''
+    tgt_labels = []
+    tgt_bboxes = []
+    tgt_deltas = []
+
+    low_thresh = -float("inf")
+    for i in range(len(gt_boxes)):
+        gt_bbox = gt_boxes[i]
+        n_gt = gt_bbox.shape[0]
+        
+        if n_gt == 0 or is_crowd is None:
+            n_is_crowd = 0 
+        else:
+            is_crowd_i = is_crowd[i]
+            n_is_crowd = paddle.nonzero(is_crowd_i).shape[0]
+
+        match_quality_matrix, _ = boxes_iou(gt_bbox, anchors)
+        assert match_quality_matrix.dim() == 2
+        
+        # ignore the iou between anchor and crowded ground-truth
+        if n_is_crowd > 0:
+            n_a = anchors.shape[0]
+            ones = paddle.ones([n_a])
+            mask = is_crowd_i * ones
+            match_quality_matrix = match_quality_matrix * (1 - mask) - mask
+            low_thresh = -1
+        # match_quality_matrix is N (gt) x M (predicted)
+        # assert (match_quality_matrix >= 0).all()
+        if match_quality_matrix.shape[0] == 0 or n_gt == n_is_crowd:
+            matches = paddle.full((match_quality_matrix.shape[1], ), 0, dtype='int64')
+            match_labels = paddle.full((match_quality_matrix.shape[1], ), 0, dtype='int32')
+        else:
+            matches, match_labels = anchor_target_matcher(match_quality_matrix,
+                                                          positive_thresh,
+                                                          negative_thresh,
+                                                          allow_low_quality_matches,
+                                                          low_thresh)
+        
+        pos_idx, neg_idx = subsample_labels(match_labels, 
+                                            batch_size_per_image, 
+                                            positive_fraction)
+
+        # Fill with the ignore label (-1), then set positive and negative labels
+        labels = paddle.full(match_labels.shape, -1, dtype='int32')
+        if neg_idx.shape[0] > 0:
+            labels = paddle.scatter(labels, neg_idx, paddle.zeros_like(neg_idx))
+        if pos_idx.shape[0] > 0:
+            labels = paddle.scatter(labels, pos_idx, paddle.ones_like(pos_idx))
+        
+        if n_gt == 0:
+            matched_gt_boxes = paddle.zeros([0, 4])
+            tgt_delta = paddle.zeros([0, 4])
+        else:
+            matched_gt_boxes = paddle.gather(gt_bbox, matches)
+            tgt_delta = bbox2delta(anchors, matched_gt_boxes, weights)
+            matched_gt_boxes.stop_gradient = True
+            tgt_delta.stop_gradient = True
+
+        labels.stop_gradient = True
+        tgt_labels.append(labels)
+        tgt_bboxes.append(matched_gt_boxes)
+        tgt_deltas.append(tgt_delta)
+
+    return tgt_labels, tgt_bboxes, tgt_deltas
+
+
+def roi_target_assign(proposals,
+                      gt_boxes,
+                      gt_classes,
+                      num_classes,
+                      positive_thresh,
+                      negative_thresh,
+                      batch_size_per_image,
+                      positive_fraction,
+                      allow_low_quality_matches=False):
+    '''
+    It performs box matching between "roi" and "target",and assigns training labels
+    to the proposals. 
+
+    Args:
+        proposals (list[tensor]): the batch RoIs from rpn_head.
+        gt_boxes (list[tensor]): gt_boxes[i] is the i'th img's gt_boxes.
+        gt_classes (list[tensor]): gt_classes[i] is the i'th img's gt_classes.
+        num_classes (int): the number of class.
+    
+    Returns:
+        proposals_info (dict): a dict contains the information of proposals. 
+    '''
+
+    proposals_info = {}
+    num_fg_samples = []
+    proposals_samples = []
+    num_proposals = []
+    gt_boxes_samples = []
+    gt_cls_samples = []
+
+    for proposals_single_img, bbox_single_img, label_single_img in zip(proposals, gt_boxes, gt_classes):
+        match_quality_matrix, _ = boxes_iou(bbox_single_img, proposals_single_img)
+        matched_idxs, matched_labels = anchor_target_matcher(match_quality_matrix, 
+                                                             positive_thresh,
+                                                             negative_thresh,
+                                                             allow_low_quality_matches)
+
+        if label_single_img.numel() > 0:
+            label_single_img = label_single_img.flatten() # squeeze may get scalar
+            label_single_img = paddle.gather(label_single_img, matched_idxs)
+            label_single_img = paddle.where(matched_labels == 0,
+                                            paddle.full_like(label_single_img, num_classes),
+                                            label_single_img)
+
+            label_single_img = paddle.where(matched_labels == -1,
+                                            paddle.full_like(label_single_img, -1),
+                                            label_single_img)
+        else:
+            label_single_img = paddle.zeros_like(matched_idxs) + num_classes
+            sample_gt_box = paddle.zeros_like(bbox_single_img)
+
+        sampled_fg_idxs, sampled_bg_idxs = subsample_labels(label_single_img,
+                                                            batch_size_per_image,
+                                                            positive_fraction,
+                                                            num_classes)
+
+        sampled_idxs = paddle.concat([sampled_fg_idxs, sampled_bg_idxs])
+        sample_proposal = paddle.gather(proposals_single_img, sampled_idxs)
+        sample_gt_cls = paddle.gather(label_single_img, sampled_idxs)
+
+        if label_single_img.numel() > 0:
+            sample_box_idx = paddle.gather(matched_idxs, sampled_idxs)
+            sample_gt_box = paddle.gather(bbox_single_img, sample_box_idx)
+
+        num_fg_samples.append(sampled_fg_idxs.shape[0])      
+        proposals_samples.append(sample_proposal)
+        num_proposals.append(sampled_idxs.shape[0])
+        gt_boxes_samples.append(sample_gt_box)
+        gt_cls_samples.append(sample_gt_cls)
+    
+    proposals_info["num_fg"] = num_fg_samples
+    proposals_info["proposals"] = proposals_samples
+    proposals_info["num_proposals"] = num_proposals
+    proposals_info["gt_boxes"] = gt_boxes_samples
+    proposals_info["gt_classes"] = gt_cls_samples
+
+    return proposals_info
diff --git a/object_detection/Swin/det_heads/maskrcnn_head/config.py b/object_detection/Swin/det_heads/maskrcnn_head/config.py
new file mode 100644
index 00000000..5293c9ec
--- /dev/null
+++ b/object_detection/Swin/det_heads/maskrcnn_head/config.py
@@ -0,0 +1,51 @@
+import sys
+import numpy as np
+import paddle
+from yacs.config import CfgNode as CN
+
+config = CN()
+config.FPN = CN()
+config.RPN = CN()
+config.ROI = CN()
+config.ROI.BOX_HEAD = CN()
+
+config.FPN.OUT_CHANNELS = 256
+config.RPN.ANCHOR_SIZE = [[32], [64], [128], [256], [512]]
+config.RPN.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+config.RPN.STRIDES = [4, 8, 16, 32, 64]
+config.RPN.OFFSET = 0.0
+config.RPN.PRE_NMS_TOP_N_TRAIN = 2000
+config.RPN.POST_NMS_TOP_N_TRAIN = 1000
+config.RPN.PRE_NMS_TOP_N_TEST = 1000
+config.RPN.POST_NMS_TOP_N_TEST = 1000
+config.RPN.NMS_THRESH = 0.7
+config.RPN.MIN_SIZE = 0.0
+config.RPN.TOPK_AFTER_COLLECT = True
+config.RPN.POSITIVE_THRESH = 0.7
+config.RPN.NEGATIVE_THRESH = 0.3
+config.RPN.BATCH_SIZE_PER_IMG = 256
+config.RPN.POSITIVE_FRACTION = 0.5
+config.RPN.LOW_QUALITY_MATCHES = True
+
+config.ROI.SCORE_THRESH_INFER = 0.05
+config.ROI.NMS_THRESH_INFER = 0.5
+config.ROI.NMS_KEEP_TOPK_INFER =100
+config.ROI.NUM_ClASSES = 80
+config.ROI.POSITIVE_THRESH = 0.5
+config.ROI.NEGATIVE_THRESH = 0.5
+config.ROI.BATCH_SIZE_PER_IMG = 512
+config.ROI.POSITIVE_FRACTION = 0.25
+config.ROI.LOW_QUALITY_MATCHES = True
+config.ROI.BOX_HEAD.REG_WEIGHTS = [10.0, 10.0, 5.0, 5.0]
+config.ROI.BOX_HEAD.NUM_CONV = 0
+config.ROI.BOX_HEAD.CONV_DIM = 256
+config.ROI.BOX_HEAD.NUM_FC = 2
+config.ROI.BOX_HEAD.FC_DIM = 1024
+config.ROI.SCALES = [1./4., 1./8., 1./16., 1./32., 1./64.]
+config.ROI.ALIGN_OUTPUT_SIZE = 7
+config.ROI.SAMPLING_RATIO = 0
+config.ROI.CANONICAL_BOX_SIZE = 224
+config.ROI.CANONICAL_LEVEL = 4
+config.ROI.MIN_LEVEL = 0
+config.ROI.MAX_LEVEL = 3
+config.ROI.ALIGNED = True
diff --git a/object_detection/Swin/det_heads/maskrcnn_head/roi_head.py b/object_detection/Swin/det_heads/maskrcnn_head/roi_head.py
new file mode 100644
index 00000000..a5ebd342
--- /dev/null
+++ b/object_detection/Swin/det_heads/maskrcnn_head/roi_head.py
@@ -0,0 +1,310 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.initializer import XavierNormal, XavierUniform, Normal
+
+from ..det_utils.target_assign import roi_target_assign
+from ..det_utils.generator_utils import RoIAlign
+from ..det_utils.box_utils import bbox2delta, delta2bbox, multiclass_nms
+
+
+class BoxHead(nn.Layer):
+    """
+    A head with several 3x3 conv layers (each followed by norm & relu), then
+    several fc layers (each followed by relu) and followed by two linear layers 
+    for predicting Fast R-CNN outputs.
+    """
+
+    def __init__(
+        self,
+        num_classes,
+        in_channels,
+        output_size,
+        num_conv,
+        conv_dim,
+        num_fc,
+        fc_dim,
+    ):
+        '''
+        Attributes:
+            num_classes (int): the number of class.
+            in_channels (int): the channels of inputs.
+            output_size (int): the size of output from pooler.
+            num_conv (int): the number of conv.
+            conv_dim (int): the output channels of each conv.
+            num_fc (int): the number of fc.
+            fc_dim (int): the output channels of each fc. 
+        '''
+        
+        super(BoxHead, self).__init__()
+        conv_dims = [conv_dim] * num_conv
+        fc_dims = [fc_dim] * num_fc
+        self.forward_net = nn.Sequential()
+
+        for i, channel in enumerate(conv_dims):
+            conv = nn.Conv2D(
+                in_channels=in_channels,
+                out_channels=channel,
+                kernel_size=3,
+                padding=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierNormal(fan_in=0.0)),
+                bias_attr=True
+            )
+
+            self.forward_net.add_sublayer("conv{}".format(i), conv)
+            self.forward_net.add_sublayer("act_c{}".format(i), nn.ReLU())
+            in_channels = channel
+        
+        in_dim = output_size * output_size *in_channels
+        for i, out_dim in enumerate(fc_dims):
+            if i == 0:
+                self.forward_net.add_sublayer("flatten", nn.Flatten())
+
+            fc = nn.Linear(in_dim,
+                           out_dim,
+                           weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_in=in_dim, fan_out=in_dim)))
+
+            self.forward_net.add_sublayer("linear{}".format(i), fc)
+            self.forward_net.add_sublayer("act_f{}".format(i), nn.ReLU())
+            in_dim = out_dim
+
+        self.cls_fc = nn.Linear(in_dim, 
+                                num_classes + 1,
+                                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.reg_fc = nn.Linear(in_dim, 
+                                num_classes * 4,
+                                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.001)))
+
+    def forward(self, inputs):
+        feats = self.forward_net(inputs)
+        pred_scores = self.cls_fc(feats)
+        pred_deltas = self.reg_fc(feats)
+
+        return [pred_scores, pred_deltas]
+
+
+class RoIHead(nn.Layer):
+    '''
+    RoIHead will match proposals from RPNHead with gt (when training),
+    crop the regions and extract per-region features using proposals,
+    and make per-region predictions.
+    '''
+    def __init__(self, config):
+        super(RoIHead, self).__init__()
+        self.config = config
+
+        self.pooler = RoIAlign(
+            output_size=config.ROI.ALIGN_OUTPUT_SIZE,
+            scales=config.ROI.SCALES,
+            sampling_ratio=config.ROI.SAMPLING_RATIO,
+            canonical_box_size=config.ROI.CANONICAL_BOX_SIZE,
+            canonical_level=config.ROI.CANONICAL_LEVEL,
+            min_level=config.ROI.MIN_LEVEL,
+            max_level=config.ROI.MAX_LEVEL,
+            aligned=config.ROI.ALIGNED
+        )
+
+        self.predictor = BoxHead(
+            num_classes=config.ROI.NUM_ClASSES,
+            in_channels=config.FPN.OUT_CHANNELS,
+            output_size=config.ROI.ALIGN_OUTPUT_SIZE,
+            num_conv=config.ROI.BOX_HEAD.NUM_CONV,
+            conv_dim=config.ROI.BOX_HEAD.CONV_DIM,
+            num_fc=config.ROI.BOX_HEAD.NUM_FC,
+            fc_dim=config.ROI.BOX_HEAD.FC_DIM
+        )
+    
+    def _det_forward(self, feats, proposals_info):
+        roi = proposals_info["proposals"]
+        rois_num = paddle.to_tensor(proposals_info["num_proposals"]).astype("int32")
+        roi_feats = self.pooler(feats, roi, rois_num)
+        predictions = self.predictor(roi_feats)
+
+        return predictions
+    
+    def _get_loss(self, preds, proposals_info):
+        '''
+        Args:
+            preds (list[tensor]): 
+               pred_scores (tensor) shape is (num_proposals, num_cls + 1), The pred class score.
+               pred_deltas (tensor) shape is (num_proposals, num_cls * 4), The pred location.
+        '''
+        pred_scores, pred_deltas = preds
+        n_s = pred_deltas.shape[0]
+
+        proposals = proposals_info["proposals"]
+        gt_classes = paddle.concat(proposals_info["gt_classes"]).reshape([-1])
+        gt_boxes = paddle.concat(proposals_info["gt_boxes"])
+
+        if len(proposals) == 0:
+            proposals = paddle.zeros(shape=[n_s, 4], dtype="float32")
+            tgt_scores = paddle.full(shape=[n_s,], fill_value=-1, dtype="float32")
+            tgt_boxes = paddle.zeros(shape=[n_s, 4], dtype="float32")
+        else:
+            proposals = paddle.concat(proposals)
+            tgt_scores = gt_classes.reshape([-1, 1])
+            tgt_boxes = gt_boxes.reshape([-1, 4])
+
+        losses = {
+            "loss_cls": F.cross_entropy(pred_scores, tgt_scores.astype("int64"), reduction='mean')
+        }
+
+        fg_idx = paddle.nonzero(
+            paddle.logical_and(gt_classes >= 0, gt_classes < self.config.ROI.NUM_ClASSES)
+        ).flatten()
+
+        #TODO: errors raised when fg_idx is [] tensor, when train from scratch
+        fg_cls_base = paddle.gather(x=gt_classes, index=fg_idx)
+        fg_cls_start = paddle.arange(0, self.config.ROI.NUM_ClASSES * fg_idx.shape[0], self.config.ROI.NUM_ClASSES)
+        fg_cls_idx = fg_cls_base + fg_cls_start
+        fg_cls_idx = fg_cls_idx.astype('int64')
+
+        fg_idx.stop_gradient = True
+        tgt_boxes.stop_gradient = True
+        proposals.stop_gradient = True
+        tgt_scores.stop_gradient = True
+        fg_cls_base.stop_gradient = True
+        fg_cls_start.stop_gradient = True
+
+        pred_deltas = pred_deltas.reshape([-1, self.config.ROI.NUM_ClASSES, 4])
+        pred_deltas = paddle.gather(pred_deltas, fg_idx, axis=0).reshape([-1, 4])
+
+        pred_deltas = paddle.gather(pred_deltas, fg_cls_idx)
+
+        tgt_boxes = paddle.gather(tgt_boxes, fg_idx)
+        proposals = paddle.gather(proposals, fg_idx)
+
+        tgt_deltas = bbox2delta(proposals, tgt_boxes, self.config.ROI.BOX_HEAD.REG_WEIGHTS)
+
+        loss_reg = F.l1_loss(pred_deltas, tgt_deltas, reduction="sum") / max(gt_classes.numel(), 1.0)
+
+        losses["loss_reg"] = loss_reg
+
+        return losses
+    
+    def _inference(self, preds, proposals_info, inputs):
+        num_proposals = proposals_info["num_proposals"]
+        proposals = proposals_info["proposals"]
+        proposals = paddle.concat(proposals)
+
+        if not len(num_proposals):
+            return None
+        
+        pred_scores, pred_deltas = preds
+
+        # pred_bbox shape [num_proposals_all, num_classes, 4]
+        pred_bbox = delta2bbox(pred_deltas, 
+                               proposals, 
+                               self.config.ROI.BOX_HEAD.REG_WEIGHTS)
+
+        pred_bbox_list = paddle.split(pred_bbox, num_proposals)
+        pred_bbox_list = paddle.split(pred_bbox, num_proposals)
+        pred_scores = F.softmax(pred_scores)
+        pred_scores_list = paddle.split(pred_scores, num_proposals)
+
+        post_pred = []
+        for i in range(len(pred_bbox_list)):
+            num_p = num_proposals[i]
+            img_pred_boxes = pred_bbox_list[i]
+            img_pred_scores = pred_scores_list[i]
+            img_hw = inputs["imgs_shape"][i]
+            img_scale_factor = inputs["scale_factor_wh"][i]
+
+            img_pred_boxes[:, :, 0::2] = paddle.clip(
+                img_pred_boxes[:, :, 0::2], min=0, max=img_hw[1]
+            ) / img_scale_factor[0]
+
+            img_pred_boxes[:, :, 1::2] = paddle.clip(
+                img_pred_boxes[:, :, 1::2], min=0, max=img_hw[0]
+            ) / img_scale_factor[1]
+
+
+            output = multiclass_nms(bboxes=img_pred_boxes,
+                                    scores=img_pred_scores[:, :-1],
+                                    score_threshold=self.config.ROI.SCORE_THRESH_INFER,
+                                    keep_top_k=self.config.ROI.NMS_KEEP_TOPK_INFER,
+                                    nms_threshold=self.config.ROI.NMS_THRESH_INFER,
+                                    background_label=self.config.ROI.NUM_ClASSES,
+                                    rois_num=paddle.to_tensor([num_p]).astype("int32"))
+
+            if output[1][0] == 0:
+                post_pred.append(paddle.to_tensor([]))
+                continue
+
+            post_label = output[0][:, 0:1]
+            post_score = output[0][:, 1:2]
+            post_boxes = output[0][:, 2:]
+
+            boxes_w = post_boxes[:, 2] - post_boxes[:, 0]
+            boxes_h = post_boxes[:, 3] - post_boxes[:, 1]
+
+            keep = paddle.nonzero(paddle.logical_and(boxes_w > 0., boxes_h > 0.)).flatten()
+
+            post_label = paddle.gather(post_label, keep)
+            post_score = paddle.gather(post_score, keep)
+            post_boxes = paddle.gather(post_boxes, keep)
+
+            final_output = paddle.concat([post_label, post_score, post_boxes], axis=-1)
+            post_pred.append(final_output)
+        
+        return post_pred
+
+    def forward(self, feats, proposals, inputs):
+        '''
+        Args:
+            feats (list[tensor]): the outputs of fpn.
+            proposals (list[tensor]): list[i] denotes the proposals of the i'th imgs
+                from rpn head.
+            inputs (dict): the gt info, eg. gt_boxes, gt_classes, imgs_wh and so on.   
+        
+        Returns:
+            losses (dict) | outputs (list[tensor]): 
+                losses contains cls_losses and reg_losses.
+                the shape of outputs[i] is [M, 6], M is the number of final preds,
+                Each row has 6 values: [label, score, xmin, ymin, xmax, ymax]
+        '''
+
+        if self.training:
+            proposals_info = roi_target_assign(
+                proposals,
+                inputs["gt_boxes"],
+                inputs["gt_classes"],
+                self.config.ROI.NUM_ClASSES,
+                self.config.ROI.POSITIVE_THRESH,
+                self.config.ROI.NEGATIVE_THRESH,
+                self.config.ROI.BATCH_SIZE_PER_IMG,
+                self.config.ROI.POSITIVE_FRACTION,
+                self.config.ROI.LOW_QUALITY_MATCHES
+            )
+
+            predictions = self._det_forward(feats, proposals_info)
+            losses = self._get_loss(predictions, proposals_info)
+
+            return losses
+        
+        else:
+            proposals_info = {"num_proposals": [len(proposal) for proposal in proposals]}
+            proposals_info["proposals"] = proposals
+
+            predictions = self._det_forward(feats, proposals_info)
+            outputs = self._inference(predictions, proposals_info, inputs)
+
+            return outputs
diff --git a/object_detection/Swin/det_heads/maskrcnn_head/rpn_head.py b/object_detection/Swin/det_heads/maskrcnn_head/rpn_head.py
new file mode 100644
index 00000000..9f98ebf3
--- /dev/null
+++ b/object_detection/Swin/det_heads/maskrcnn_head/rpn_head.py
@@ -0,0 +1,236 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.initializer import Normal
+
+import sys
+from ..det_utils.generator_utils import AnchorGenerator, ProposalGenerator
+from ..det_utils.target_assign import anchor_target_assign
+
+
+class RPNHead(nn.Layer):
+    """
+    Region Proposal Network uses a 3x3 conv to produce a shared hidden state from which one 1x1 conv 
+    predicts objectness logits for each anchor and a second 1x1 conv predicts bounding-box deltas.
+
+    Attributes:
+        anchor_generator (class): the generator of anchor. 
+        train_proposal (class): configure of proposals generation at the stage of training.
+        test_proposal (class): configure of proposals generation at the stage of prediction.
+        in_channels (int): channel of input feature maps which can be derived by from_config.
+    """
+    def __init__(self, config):
+        super(RPNHead, self).__init__()
+        self.anchor_generator = AnchorGenerator(anchor_sizes=config.RPN.ANCHOR_SIZE,
+                                                aspect_ratios=config.RPN.ASPECT_RATIOS,
+                                                strides=config.RPN.STRIDES,
+                                                offset=config.RPN.OFFSET)
+        self.train_proposal = ProposalGenerator(pre_nms_top_n=config.RPN.PRE_NMS_TOP_N_TRAIN,
+                                                post_nms_top_n=config.RPN.POST_NMS_TOP_N_TRAIN,
+                                                nms_thresh=config.RPN.NMS_THRESH,
+                                                min_size=config.RPN.MIN_SIZE,
+                                                topk_after_collect=config.RPN.TOPK_AFTER_COLLECT)
+        self.test_proposal = ProposalGenerator(pre_nms_top_n=config.RPN.PRE_NMS_TOP_N_TEST,
+                                               post_nms_top_n=config.RPN.POST_NMS_TOP_N_TEST,
+                                               nms_thresh=config.RPN.NMS_THRESH,
+                                               min_size=config.RPN.MIN_SIZE,
+                                               topk_after_collect=config.RPN.TOPK_AFTER_COLLECT)
+
+        self.num_anchors = self.anchor_generator.num_anchors
+
+        num_channels = config.FPN.OUT_CHANNELS
+        self.conv = nn.Conv2D(num_channels,
+                              num_channels,
+                              kernel_size=3,
+                              padding=1,
+                              weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.objectness_logits = nn.Conv2D(num_channels,
+                                           self.num_anchors,
+                                           kernel_size=1,
+                                           padding=0,
+                                           weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.anchor_deltas = nn.Conv2D(num_channels,
+                                       self.num_anchors * 4,
+                                       kernel_size=1,
+                                       padding=0,
+                                       weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.config = config
+
+    def predict(self, feats):
+        '''
+        Predict the logits of each feature and the deltas of the anchors in each feature.
+
+        Args:
+            feats (list[tensor]): Mutil-level feature from fpn.
+
+        Returns:
+            pred_objectness_logits (list[tensor]): A list of L elements.Element i is a tensor of shape (N, A, Hi, Wi) representing
+                the predicted objectness logits for all anchors. A is the number of cell anchors.
+            pred_anchor_deltas (list[tensor]): A list of L elements. Element i is a tensor of shape (N, A * 4, Hi, Wi) 
+                representing the predicted "deltas" used to transform anchors to proposals.
+        '''
+        
+        pred_objectness_logits = []
+        pred_anchor_deltas = []
+        for feat in feats:
+            out = F.relu(self.conv(feat))
+            pred_objectness_logits.append(self.objectness_logits(out))
+            pred_anchor_deltas.append(self.anchor_deltas(out))
+
+        return pred_objectness_logits, pred_anchor_deltas
+    
+    def _get_proposals(self, scores, bbox_deltas, anchors, inputs):
+        '''
+        Args:
+            scores (list[tensor]): the prediction logits of the mutil-level features.
+                scores[i].shape is [N, A, Hi, Wi]
+            bbox_deltas (list[tensor]): the prediction anchor deltas of the mutil-level features.
+                bbox_deltas[i].shape is [N, 4 * A, Hi, Wi]
+            anchors (list[tensor]): the prediction anchor of the mutil-level features.
+                anchors[i].shape is [Hi * Wi * A, 4]
+            inputs (dict): ground truth info.
+        '''
+        proposal_gen = self.train_proposal if self.training else self.test_proposal
+
+        imgs_shape = inputs["imgs_shape"]
+        if isinstance(imgs_shape, list):
+            imgs_shape = paddle.stack(imgs_shape).astype("float32")
+
+        batch_size = len(imgs_shape)
+
+        batch_proposal_rois = []
+        batch_proposal_rois_num = []
+        for i in range(batch_size):
+            single_img_rois_list = []
+            single_img_prob_list = []
+
+            for level_scores, level_deltas, level_anchors in zip(scores, bbox_deltas, anchors):
+                level_rois, level_rois_prob, _, post_nms_top_n = proposal_gen(
+                    scores = level_scores[i:i+1],
+                    bbox_deltas = level_deltas[i:i+1],
+                    anchors = level_anchors,
+                    imgs_shape = imgs_shape[i:i+1]
+                )
+                if level_rois.shape[0] > 0:
+                    single_img_rois_list.append(level_rois)
+                    single_img_prob_list.append(level_rois_prob)
+            
+            if len(single_img_rois_list) == 0:
+                single_img_rois = paddle.zeros(shape=[0, 4]).astype("float32")
+            else:
+                single_img_rois = paddle.concat(single_img_rois_list)
+                single_img_prob = paddle.concat(single_img_prob_list).flatten()
+
+                if single_img_prob.shape[0] > post_nms_top_n:
+                    single_img_topk_prob, topk_inds = paddle.topk(single_img_prob, post_nms_top_n)
+                    single_img_topk_rois = paddle.gather(single_img_rois, topk_inds)
+                else:
+                    single_img_topk_rois = single_img_rois
+            
+            batch_proposal_rois.append(single_img_topk_rois)
+            batch_proposal_rois_num.append(single_img_topk_rois.shape[0])
+        
+        return batch_proposal_rois, batch_proposal_rois_num
+    
+    def _get_losses(self, pred_logits, pred_loc, anchors, inputs):
+        anchors = paddle.concat(anchors)
+        gt_boxes = inputs["gt_boxes"]
+        is_crowd = inputs.get("is_crowd", None)
+
+        tgt_scores, tgt_bboxes, tgt_deltas = anchor_target_assign(
+            anchors,
+            gt_boxes,
+            positive_thresh = self.config.RPN.POSITIVE_THRESH,
+            negative_thresh = self.config.RPN.NEGATIVE_THRESH,
+            batch_size_per_image = self.config.RPN.BATCH_SIZE_PER_IMG,
+            positive_fraction = self.config.RPN.POSITIVE_FRACTION,
+            allow_low_quality_matches = self.config.RPN.LOW_QUALITY_MATCHES,
+            is_crowd = is_crowd
+        )
+
+        # reshape to [N, Hi * Wi * A, 1] for compute loss
+        pred_scores = [
+            s.transpose([0, 2, 3, 1]).reshape([s.shape[0], -1, 1]) for s in pred_logits
+            ]
+        
+        pred_deltas = [
+            d.transpose([0, 2, 3, 1]).reshape([d.shape[0], -1, 4]) for d in pred_loc
+        ]
+
+        pred_scores = paddle.concat(pred_scores, axis = 1).reshape([-1])
+        pred_deltas = paddle.concat(pred_deltas, axis = 1).reshape([-1, 4])
+
+        tgt_scores = paddle.concat(tgt_scores).astype("float32")
+        tgt_deltas = paddle.concat(tgt_deltas).astype("float32")
+        tgt_scores.stop_gradient = True
+        tgt_deltas.stop_gradient = True
+
+        pos_idx = paddle.nonzero(tgt_scores == 1)
+        valid_idx = paddle.nonzero(tgt_scores >= 0)
+
+        if valid_idx.shape[0] == 0:
+            loss_rpn_cls = paddle.zeros([1]).astype("float32")
+        else:
+            pred_scores = paddle.gather(pred_scores, valid_idx)
+            tgt_scores = paddle.gather(tgt_scores, valid_idx).astype("float32")
+            tgt_scores.stop_gradient = True
+            loss_rpn_cls = F.binary_cross_entropy_with_logits(
+                logit=pred_scores, 
+                label=tgt_scores, 
+                reduction="sum"
+            )
+
+        if pos_idx.shape[0] == 0:
+            loss_rpn_reg = paddle.zeros([1]).astype("float32")
+        else:
+            pred_deltas = paddle.gather(pred_deltas, pos_idx)
+            tgt_deltas = paddle.gather(tgt_deltas, pos_idx)
+            loss_rpn_reg = paddle.abs(pred_deltas - tgt_deltas).sum()
+
+        norm = self.config.RPN.BATCH_SIZE_PER_IMG * len(gt_boxes)
+
+        return {
+            'loss_rpn_cls': loss_rpn_cls / norm,
+            'loss_rpn_reg': loss_rpn_reg / norm
+        }
+
+    def forward(self, feats, inputs):
+        '''
+        Args:
+            feats (list[tensor]): Mutil-level feature from fpn.
+            inputs (dict): ground truth info.
+        
+        Returns:
+            rois (list[tensor]): rois[i] is proposals of the i'th img.
+            rois_num (list[int]): rois[i] is number of the i'th img's proposals. 
+            losses_dict (dict | None): when training is dict contains loss_rpn_cls and loss_rpn_reg.
+        '''
+        pred_objectness_logits, pred_anchor_deltas = self.predict(feats)
+        anchors = self.anchor_generator(feats)
+
+        rois, rois_num = self._get_proposals(pred_objectness_logits, pred_anchor_deltas, anchors, inputs)
+        
+        if self.training:
+            losses_dict = self._get_losses(pred_objectness_logits, pred_anchor_deltas, anchors, inputs)
+
+            return rois, rois_num, losses_dict
+        else:
+            return rois, rois_num, None
diff --git a/object_detection/Swin/det_heads/retinanet_head/config.py b/object_detection/Swin/det_heads/retinanet_head/config.py
new file mode 100644
index 00000000..8799956c
--- /dev/null
+++ b/object_detection/Swin/det_heads/retinanet_head/config.py
@@ -0,0 +1,27 @@
+import numpy as np
+import paddle
+from yacs.config import CfgNode as CN
+
+config = CN()
+config.RETINANET = CN()
+
+config.RETINANET.NUM_CONVS = 4
+config.RETINANET.INPUT_CHANNELS = 256
+config.RETINANET.NORM = ""
+config.RETINANET.PRIOR_PROB = 0.01
+config.RETINANET.NUM_CLASSES = 80
+config.RETINANET.FOCAL_LOSS_ALPHA = 0.25
+config.RETINANET.FOCAL_LOSS_GAMMA = 2
+config.RETINANET.SMOOTHL1_LOSS_DELTA = 0
+config.RETINANET.POSITIVE_THRESH = 0.5
+config.RETINANET.NEGATIVE_THRESH = 0.4
+config.RETINANET.ALLOW_LOW_QUALITY = True
+config.RETINANET.WEIGHTS = [1.0, 1.0, 1.0, 1.0]
+config.RETINANET.SCORE_THRESH = 0.05
+config.RETINANET.KEEP_TOPK = 100
+config.RETINANET.NMS_TOPK = 1000
+config.RETINANET.NMS_THRESH = 0.5
+config.RETINANET.ANCHOR_SIZE = [[x, x * 2**(1.0/3), x * 2**(2.0/3)] for x in [32, 64, 128, 256, 512 ]]
+config.RETINANET.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+config.RETINANET.STRIDES = [8.0, 16.0, 32.0, 64.0, 128.0]
+config.RETINANET.OFFSET = 0
\ No newline at end of file
diff --git a/object_detection/Swin/det_heads/retinanet_head/post_process.py b/object_detection/Swin/det_heads/retinanet_head/post_process.py
new file mode 100644
index 00000000..79a5def8
--- /dev/null
+++ b/object_detection/Swin/det_heads/retinanet_head/post_process.py
@@ -0,0 +1,121 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+
+from det_utils.box_utils import nonempty_bbox, delta2bbox, multiclass_nms
+
+class RetinaNetPostProcess(object):
+    '''
+    This class used to post_process the RetianNet-Head's output.
+    '''
+    def __init__(self, 
+                 score_threshold,
+                 keep_top_k,
+                 nms_top_k,
+                 nms_threshold,
+                 bbox_reg_weights=[1.0, 1.0, 1.0, 1.0]):
+        super(RetinaNetPostProcess, self).__init__()
+        self.score_threshold=score_threshold
+        self.keep_topk=keep_top_k
+        self.topk_candidates=nms_top_k
+        self.num_thresh=nms_threshold
+        self.bbox_reg_weights = bbox_reg_weights
+
+    def _process_single_level_pred(self, box_lvl, score_lvl, anchors, scale_factor_wh, img_whwh):
+        if isinstance(scale_factor_wh, list):
+            scale_factor_wh = paddle.concat(scale_factor_wh)
+        if isinstance(img_whwh, list):
+            img_whwh = paddle.concat(img_whwh)
+
+        score_lvl = paddle.transpose(score_lvl, [0, 2, 1])
+        score_lvl = F.sigmoid(score_lvl)
+
+        batch_lvl = []
+        for i in range(len(img_whwh)):
+            box_lvl_i = delta2bbox(box_lvl[i],
+                                    anchors,
+                                    self.bbox_reg_weights).reshape(anchors.shape)
+
+            box_lvl_i[:, 0::2] = paddle.clip(
+                box_lvl_i[:, 0::2], min=0, max=img_whwh[i][0]
+            ) / scale_factor_wh[i][0]
+            box_lvl_i[:, 1::2] =  paddle.clip(
+                box_lvl_i[:, 1::2], min=0, max=img_whwh[i][1]
+            ) / scale_factor_wh[i][1]
+
+            batch_lvl.append(box_lvl_i)
+
+        box_lvl = paddle.stack(batch_lvl)
+
+        return box_lvl, score_lvl
+
+    def __call__(self, pred_scores_list, pred_boxes_list, anchors, scale_factor_wh, img_whwh):
+        """
+        Args:
+            pred_scores_list (list[Tensor]): tensor of shape (batch_size, R, num_classes).
+                The tensor predicts the classification probability for each proposal.
+            pred_boxes_list (list[Tensor]): tensors of shape (batch_size, R, 4).
+                The tensor predicts anchor's delta
+            anchors (list[Tensor]): mutil-level anchors.
+            scale_factor_wh (Tensor): tensors of shape [batch_size, 2] the scalor of  per img
+            img_whwh (Tensor): tensors of shape [batch_size, 4]
+        Returns:
+            bbox_pred (Tensor): tensors of shape [num_boxes, 6] Each row has 6 values:
+            [label, confidence, xmin, ymin, xmax, ymax]
+            bbox_num (Tensor): tensors of shape [batch_size] the number of RoIs in each image.
+        """
+        assert len(pred_boxes_list[0]) == len(scale_factor_wh) == len(img_whwh)
+        assert len(pred_boxes_list) == len(anchors)
+
+        mutil_level_bbox = []
+        mutil_level_score = []
+
+        for i in range(len(pred_boxes_list)):
+            lvl_res_b, lvl_res_s = self._process_single_level_pred(
+                pred_boxes_list[i],
+                pred_scores_list[i],
+                anchors[i],
+                scale_factor_wh,
+                img_whwh)
+
+            mutil_level_bbox.append(lvl_res_b)
+            mutil_level_score.append(lvl_res_s)
+
+        pred_boxes = paddle.concat(mutil_level_bbox, axis=1)     # [N, \sum_{i=0}^{n} (Hi * Wi), 4]
+        pred_scores = paddle.concat(mutil_level_score, axis=2)
+
+        assert pred_boxes.shape[1] == pred_scores.shape[2]
+
+        bbox_pred, bbox_num, _ = multiclass_nms(
+            pred_boxes, 
+            pred_scores,
+            score_threshold=self.score_threshold,
+            keep_top_k=self.keep_topk,
+            nms_top_k=self.topk_candidates,
+            nms_threshold=self.num_thresh,
+        )
+
+        pred_label = bbox_pred[:, 0:1]
+        pred_score = bbox_pred[:, 1:2]
+        pred_bbox = bbox_pred[:, 2:]
+        keep_mask = nonempty_bbox(pred_bbox, return_mask=True)
+        keep_mask = paddle.unsqueeze(keep_mask, [1])
+        pred_label = paddle.where(keep_mask, pred_label,
+                                  paddle.ones_like(pred_label) * -1)
+
+        pred_result = paddle.concat([pred_label, pred_score, pred_bbox], axis=1)
+
+        return pred_result, bbox_num
diff --git a/object_detection/Swin/det_heads/retinanet_head/retinanet_head.py b/object_detection/Swin/det_heads/retinanet_head/retinanet_head.py
new file mode 100644
index 00000000..2230323f
--- /dev/null
+++ b/object_detection/Swin/det_heads/retinanet_head/retinanet_head.py
@@ -0,0 +1,166 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+import paddle.nn as nn
+
+from paddle.nn.initializer import Normal, Constant
+
+from retinanet_loss import RetinaNetLoss
+from post_process import RetinaNetPostProcess
+from det_utils.generator_utils import AnchorGenerator
+
+class RetinaNetHead(nn.Layer):
+    '''
+    The head used in RetinaNet for object classification and box regression.
+    It has two subnets for the two tasks, with a common structure but separate parameters.
+    '''
+    def __init__(self, config):
+        '''
+        Args:
+            input_shape (List[ShapeSpec]): input shape.
+            num_classes (int): number of classes. Used to label background proposals.
+            num_anchors (int): number of generated anchors.
+            conv_dims (List[int]): dimensions for each convolution layer.
+            norm (str or callable):
+                    Normalization for conv layers except for the two output layers.
+                    See :func:`detectron2.layers.get_norm` for supported types.
+            loss_func (class): the class is used to compute loss.
+            prior_prob (float): Prior weight for computing bias.
+        '''
+        super(RetinaNetHead, self).__init__()
+
+        num_convs = config.RETINANET.NUM_CONVS
+        input_channels = config.RETINANET.INPUT_CHANNELS
+        norm = config.RETINANET.NORM
+        prior_prob = config.RETINANET.PRIOR_PROB
+
+        self.num_classes = config.RETINANET.NUM_CLASSES
+        self.get_loss = RetinaNetLoss(
+            focal_loss_alpha=config.RETINANET.FOCAL_LOSS_ALPHA,
+            focal_loss_gamma=config.RETINANET.FOCAL_LOSS_GAMMA,
+            smoothl1_loss_delta=config.RETINANET.SMOOTHL1_LOSS_DELTA,
+            positive_thresh=config.RETINANET.POSITIVE_THRESH,
+            negative_thresh=config.RETINANET.NEGATIVE_THRESH,
+            allow_low_quality=config.RETINANET.ALLOW_LOW_QUALITY,
+            num_classes=config.RETINANET.NUM_CLASSES,
+            weights=config.RETINANET.WEIGHTS
+        )
+        self.postprocess = RetinaNetPostProcess(
+            score_threshold=config.RETINANET.SCORE_THRESH,
+            keep_top_k=config.RETINANET.KEEP_TOPK,
+            nms_top_k=config.RETINANET.NMS_TOPK,
+            nms_threshold=config.RETINANET.NMS_THRESH,
+            bbox_reg_weights=config.RETINANET.WEIGHTS
+        )
+        self.anchor_generator = AnchorGenerator(anchor_sizes=config.RETINANET.ANCHOR_SIZE,
+                                                aspect_ratios=config.RETINANET.ASPECT_RATIOS,
+                                                strides=config.RETINANET.STRIDES,
+                                                offset=config.RETINANET.OFFSET)
+
+        num_anchors = self.anchor_generator.num_anchors
+        conv_dims = [input_channels] * num_convs
+
+        cls_net = []
+        reg_net = []
+
+        for in_channels, out_channels in zip(
+            [input_channels] + list(conv_dims), conv_dims
+        ):
+            cls_net.append(
+                nn.Conv2D(in_channels, out_channels, kernel_size=3, stride=1, padding=1,
+                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+            )
+            if norm == "bn":
+                cls_net.append(nn.BatchNorm2D(out_channels))
+            cls_net.append(nn.ReLU())
+
+            reg_net.append(
+                nn.Conv2D(in_channels, out_channels, kernel_size=3, stride=1, padding=1,
+                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+            )
+            if norm == "bn":
+                reg_net.append(nn.BatchNorm2D(out_channels))
+            reg_net.append(nn.ReLU())
+
+        self.cls_net = nn.Sequential(*cls_net)
+        self.reg_net = nn.Sequential(*reg_net)
+
+        bias_value = -math.log((1 - prior_prob) / prior_prob)
+        self.cls_score = nn.Conv2D(
+            conv_dims[-1], num_anchors * self.num_classes, kernel_size=3, stride=1, padding=1,
+            weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)),
+            bias_attr=paddle.ParamAttr(initializer=Constant(bias_value))
+        )
+        self.bbox_pred = nn.Conv2D(
+            conv_dims[-1], num_anchors * 4, kernel_size=3, stride=1, padding=1,
+            weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01))
+        )
+
+    def forward(self, feats, inputs):
+        '''
+         Returns:
+            loss_dict (dict) | pred_result(tensor), bbox_num(tensor): 
+            loss_dict: contains cls_losses and reg_losses.
+            pred_result: the shape is [M, 6], M is the number of final preds,
+                Each row has 6 values: [label, score, xmin, ymin, xmax, ymax]
+            bbox_num: the shape is [N], N is the num of batch_size, 
+                bbox_num[i] means the i'th img have bbox_num[i] boxes.
+        '''
+        anchors = self.anchor_generator(feats)
+
+        pred_scores = []
+        pred_boxes = []
+
+        for feat in feats:
+            pred_scores.append(self.cls_score(self.cls_net(feat)))
+            pred_boxes.append(self.bbox_pred(self.reg_net(feat)))
+        
+        pred_scores_list = [
+            transpose_to_bs_hwa_k(s, self.num_classes) for s in pred_scores
+        ]
+        pred_boxes_list = [
+            transpose_to_bs_hwa_k(s, 4) for s in pred_boxes
+        ]
+
+        if self.training:
+            anchors = paddle.concat(anchors)
+            loss_dict = self.get_loss(anchors, [pred_scores_list, pred_boxes_list], inputs)
+
+            return loss_dict
+        
+        else:
+            img_whwh = paddle.concat([inputs["imgs_shape"][:, 1:2],
+                                      inputs["imgs_shape"][:, 0:1]], axis=-1)
+            pred_result, bbox_num = self.postprocess(
+                pred_scores_list, 
+                pred_boxes_list, 
+                anchors,
+                inputs["scale_factor_wh"], 
+                img_whwh
+            )
+
+            return pred_result, bbox_num
+
+
+def transpose_to_bs_hwa_k(tensor, k):
+    assert tensor.dim() == 4
+    bs, _, h, w = tensor.shape
+    tensor = tensor.reshape([bs, -1, k, h, w])
+    tensor = tensor.transpose([0, 3, 4, 1, 2])
+
+    return tensor.reshape([bs, -1, k])
diff --git a/object_detection/Swin/det_heads/retinanet_head/retinanet_loss.py b/object_detection/Swin/det_heads/retinanet_head/retinanet_loss.py
new file mode 100644
index 00000000..53cf722b
--- /dev/null
+++ b/object_detection/Swin/det_heads/retinanet_head/retinanet_loss.py
@@ -0,0 +1,142 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import sys
+sys.path.append("PPViT-od_head/object_detection/head")
+from det_utils.box_utils import bbox2delta, boxes_iou
+from det_utils.target_assign import anchor_target_matcher
+
+class RetinaNetLoss(nn.Layer):
+    def __init__(
+        self,
+        focal_loss_alpha,
+        focal_loss_gamma,
+        smoothl1_loss_delta,
+        positive_thresh,
+        negative_thresh,
+        allow_low_quality=True,
+        num_classes=80,
+        weights=[1.0, 1.0, 1.0, 1.0]
+    ):
+        super(RetinaNetLoss, self).__init__()
+
+        self.num_classes = num_classes
+        self.focal_loss_alpha = focal_loss_alpha
+        self.focal_loss_gamma = focal_loss_gamma
+        self.smoothl1_loss_delta = smoothl1_loss_delta
+        self.positive_thresh = positive_thresh
+        self.negative_thresh = negative_thresh
+        self.allow_low_quality = allow_low_quality
+        self.weights = weights
+
+        self.loss_normalizer = 100
+        self.loss_normalizer_momentum = 0.9
+
+    def label_anchors(self, anchors, gt):
+        batch_gt_box = gt["gt_boxes"]
+        batch_gt_class = gt["gt_classes"]
+
+        gt_labels_list = []
+        gt_boxes_list = []
+
+        for i in range(len(batch_gt_box)):
+            gt_boxes = batch_gt_box[i]
+            gt_classes = batch_gt_class[i].flatten()
+
+            match_quality_matrix, _ = boxes_iou(gt_boxes, anchors)
+            matches_idxs, match_labels = anchor_target_matcher(
+                match_quality_matrix, 
+                self.positive_thresh,
+                self.negative_thresh,
+                self.allow_low_quality,
+                low_thresh = -float("inf")
+            )
+
+            if len(gt_boxes) > 0:
+                matched_boxes_i = paddle.gather(gt_boxes, matches_idxs)
+                matched_classes_i = paddle.gather(gt_classes, matches_idxs)
+                matched_classes_i = paddle.where(match_labels == 0,
+                                                 paddle.full_like(matched_classes_i, self.num_classes),
+                                                 matched_classes_i)
+                matched_classes_i = paddle.where(match_labels == -1,
+                                                 paddle.full_like(matched_classes_i, -1),
+                                                 matched_classes_i)
+            else:
+                matched_boxes_i = paddle.zeros_like(anchors)
+                matched_classes_i = paddle.zeros_like(matches_idxs) + self.num_classes
+
+            gt_boxes_list.append(matched_boxes_i)
+            gt_labels_list.append(matched_classes_i)
+
+        return gt_boxes_list, gt_labels_list
+
+    def forward(self, anchors, preds, inputs):
+
+        pred_scores_list, pred_boxes_list = preds
+
+        p_s = paddle.concat(pred_scores_list, axis=1)
+        p_b = paddle.concat(pred_boxes_list, axis=1)  # [N, R, 4]
+
+        gt_boxes, gt_classes = self.label_anchors(anchors, inputs)
+        gt_labels = paddle.stack(gt_classes).reshape([-1])  # [N * R]
+
+        valid_idx = paddle.nonzero(gt_labels >= 0)
+        pos_mask = paddle.logical_and(gt_labels >= 0, gt_labels != self.num_classes)
+        pos_idx = paddle.nonzero(pos_mask).flatten()
+        num_pos = pos_idx.shape[0]
+
+        self.loss_normalizer = self.loss_normalizer_momentum * self.loss_normalizer + (
+            1 - self.loss_normalizer_momentum
+        ) * max(num_pos, 1)
+
+        p_s = paddle.reshape(p_s, [-1, self.num_classes])
+        pred_logits = paddle.gather(p_s, valid_idx)
+
+        gt_labels = F.one_hot(paddle.gather(gt_labels, valid_idx), num_classes=self.num_classes + 1)[
+            :, :-1
+        ]
+
+        gt_labels.stop_gradient = True
+
+        cls_loss = F.sigmoid_focal_loss(pred_logits,
+                                        gt_labels,
+                                        alpha=self.focal_loss_alpha,
+                                        gamma=self.focal_loss_gamma,
+                                        reduction='sum')
+
+        gt_deltas_list = [
+            bbox2delta(anchors, gt_boxes[i], self.weights) for i in range(len(gt_boxes))
+        ]
+
+        gt_deltas = paddle.concat(gt_deltas_list)
+        gt_deltas = paddle.gather(gt_deltas, pos_idx)
+        gt_deltas.stop_gradient = True
+
+        p_b = paddle.reshape(p_b, [-1, 4])
+        pred_deltas = paddle.gather(p_b, pos_idx)
+
+        if self.smoothl1_loss_delta > 0:
+            reg_loss = F.smooth_l1_loss(pred_deltas, gt_deltas, reduction="sum",  delta=self.smoothl1_loss_delta)
+        else:
+            reg_loss = F.l1_loss(pred_deltas, gt_deltas, reduction="sum")
+
+        return {
+            "cls_loss": cls_loss / self.loss_normalizer,
+            "reg_loss": reg_loss / self.loss_normalizer
+        }
diff --git a/object_detection/Swin/det_necks/__init__.py b/object_detection/Swin/det_necks/__init__.py
new file mode 100644
index 00000000..e0a8f9c1
--- /dev/null
+++ b/object_detection/Swin/det_necks/__init__.py
@@ -0,0 +1 @@
+from . import fpn
diff --git a/object_detection/Swin/det_necks/fpn.py b/object_detection/Swin/det_necks/fpn.py
new file mode 100644
index 00000000..1cafeb77
--- /dev/null
+++ b/object_detection/Swin/det_necks/fpn.py
@@ -0,0 +1,208 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+"""FPN Lyaer for object detection"""
+import math
+import paddle
+import paddle.nn as nn
+from paddle.nn.initializer import XavierUniform
+import paddle.nn.functional as F
+
+
+class ConvNorm(nn.Layer):
+    """ Conv + BatchNorm (optional) layers
+    Args:
+        in_channels: int, num of input channels 
+        out_channels: int, num of output channels
+        kernel_size: int, conv kernel size
+        stride: int, stride in conv layer, default: 1
+        padding: int, padding in conv layer, default: 0 
+        dilation: int, dilation in conv layer, default: 1 
+        groups: int, groups in conv layer, default: 1 
+        padding_mode: str, padding mode, default: 'zeros' 
+        weight_attr: ParamAttr, paddle param setting for weight, default: None 
+        bias_attr: ParamAttr, paddle param setting for bias, default: None
+        norm: string, type of norm layer, default: bn
+    """
+    def __init__(self, 
+                 in_channels, 
+                 out_channels, 
+                 kernel_size, 
+                 stride=1, 
+                 padding=0, 
+                 dilation=1, 
+                 groups=1, 
+                 padding_mode='zeros', 
+                 weight_attr=None, 
+                 bias_attr=None,
+                 norm="bn",
+                 use_bias=False):
+        super(ConvNorm, self).__init__()
+
+        if norm is None:
+            use_bias = None
+
+        self.conv = nn.Conv2D(
+            in_channels=in_channels, 
+            out_channels=out_channels, 
+            kernel_size=kernel_size, 
+            stride=stride, 
+            padding=padding, 
+            dilation=dilation, 
+            groups=groups, 
+            padding_mode=padding_mode, 
+            weight_attr=weight_attr, 
+            bias_attr=use_bias
+        )
+
+        if norm == "bn":
+            self.norm = nn.BatchNorm2D(out_channels)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        out = self.conv(x)
+
+        if self.norm is not None:
+            out = self.norm(out)
+        
+        return out
+
+
+class FPN(nn.Layer):
+    """Feature Pyramid Network (FPN) Layer
+    Args:
+        in_channels: list of int, num of input channels for each output layer
+        out_channels: list of int, num of output channels for each output layer
+        stride: list, spatial strides between each feature layer to the original image size
+        fuse_type: str, how to fuse current and prev feature in FPN, avg or sum, default: sum
+        use_c5: bool, if True, use C5 as the input of extra stage, default: True
+        top_block: nn.Layer, if use a downsample after output (see LastLevelMaxPool), default: None
+        norm: str, type of norm layer, default: None
+    """
+    def __init__(self,
+                 in_channels,
+                 out_channel,
+                 strides,
+                 fuse_type="sum",
+                 use_c5=True,
+                 top_block=None,
+                 norm=None):
+        super(FPN, self).__init__()
+        assert len(strides) == len(in_channels)
+
+        self.fuse_type = fuse_type
+        self.top_block = top_block
+        self.use_c5 = use_c5
+
+        lateral_convs = []
+        output_convs = []
+
+        name_idx = [int(math.log2(s)) for s in strides]
+
+        for idx, in_channel in enumerate(in_channels):
+            # 1x1 conv 
+            lateral_conv = ConvNorm(
+                in_channels=in_channel, 
+                out_channels=out_channel, 
+                kernel_size=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=in_channel)),
+                norm=norm)
+            # 3x3 conv after upsampling
+            output_conv = ConvNorm(
+                in_channels=out_channel, 
+                out_channels=out_channel, 
+                kernel_size=3,
+                padding=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*out_channel)),
+                norm=norm)
+
+            self.add_sublayer("fpn_lateral{}".format(name_idx[idx]), lateral_conv)
+            self.add_sublayer("fpn_output{}".format(name_idx[idx]), output_conv)
+
+            lateral_convs.append(lateral_conv)
+            output_convs.append(output_conv)
+        
+        self.lateral_convs = lateral_convs[::-1] # Now from small feature map to large feature map
+        self.output_convs = output_convs[::-1]
+
+    def forward(self, feats):
+        res = []
+        lateral_out = self.lateral_convs[0](feats[-1]) # feats is from large to small feature map
+        res.append(self.output_convs[0](lateral_out))
+
+        for idx, (lateral_conv, output_conv) in enumerate(
+            zip(self.lateral_convs, self.output_convs)):
+            if idx > 0:  # not include lateral_convs[0]
+                top2down_feat = F.interpolate(lateral_out, scale_factor=2.0, mode="nearest")
+                prev_out = lateral_conv(feats[-1-idx])
+                lateral_out = prev_out + top2down_feat # fuse == 'sum'
+                if self.fuse_type == "avg":
+                    lateral_out /= 2
+                res.insert(0, output_conv(lateral_out))
+        
+        if self.top_block is not None:
+            if self.use_c5:
+                top_block_out = self.top_block(feats[-1])
+            else:
+                top_block_out = self.top_block(res[-1])
+        
+            res.extend(top_block_out)
+
+        return res
+
+
+class LastLevelMaxPool(nn.Layer):
+    """
+    This module is used in the original FPN to generate a downsampled
+    P6 feature from P5.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return [F.max_pool2d(x, kernel_size=1, stride=2)]
+
+
+class TopFeatP6P7(nn.Layer):
+    """
+    This module is used in RetinaNet to generate extra layers, P6 and P7 from
+    C5 feature.
+    """
+    def __init__(self, in_channel, out_channel):
+
+        self.p6 = nn.Conv2D(
+            in_channels=in_channel, 
+            out_channels=out_channel, 
+            kernel_size=3, 
+            stride=2, 
+            padding=1,
+            weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*in_channel))
+        )
+        self.p7 = nn.Conv2D(
+            in_channels=in_channel, 
+            out_channels=out_channel, 
+            kernel_size=3, 
+            stride=2, 
+            padding=1,
+            weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*out_channel))
+        )
+    
+    def forward(self, feat):
+        p6 = self.p6(feat)
+        p7 = self.p7(F.relu(p6))
+
+        return [p6, p7]
diff --git a/object_detection/Swin/main_multi_gpu.py b/object_detection/Swin/main_multi_gpu.py
new file mode 100644
index 00000000..e251ed58
--- /dev/null
+++ b/object_detection/Swin/main_multi_gpu.py
@@ -0,0 +1,420 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Swin Det training/validation using multiple GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from coco import build_coco
+from coco import get_dataloader
+from coco_eval import CocoEvaluator
+from swin_det import build_swin_det
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+
+
+parser = argparse.ArgumentParser('Swin-Det')
+parser.add_argument('-cfg', type=str, default=None)
+parser.add_argument('-dataset', type=str, default=None)
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-backbone', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+parser.add_argument('-eval', action='store_true')
+arguments = parser.parse_args()
+
+log_format = "%(asctime)s %(message)s"
+logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                    format=log_format, datefmt="%m%d %I:%M:%S %p")
+
+# get default config
+config = get_config()
+# update config by arguments
+config = update_config(config, arguments)
+
+# set output folder
+if not config.EVAL:
+    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+else:
+    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+#config.freeze()
+
+if not os.path.exists(config.SAVE):
+    os.makedirs(config.SAVE, exist_ok=True)
+
+# set logging format
+logger = logging.getLogger()
+fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
+fh.setFormatter(logging.Formatter(log_format))
+logger.addHandler(fh)
+logger.info(f'config= {config}')
+
+
+def train(dataloader,
+          model,
+          base_ds,
+          optimizer,
+          epoch,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, det model
+        base_ds: coco api instance
+        optimizer: optimizer
+        epoch: int, current epoch
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+        accum_iter: int, num of iters for accumulating gradients
+    Returns:
+        train_loss_cls_meter.avg
+        train_loss_reg_meter.avg
+        train_loss_rpn_cls_meter.avg
+        train_loss_rpn_reg_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_cls_meter = AverageMeter()
+    train_loss_reg_meter = AverageMeter()
+    train_loss_rpn_cls_meter = AverageMeter()
+    train_loss_rpn_reg_meter = AverageMeter()
+
+    time_st = time.time()
+
+    #iou_types = ('bbox', )
+    #coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    for batch_id, data in enumerate(dataloader):
+        samples = data[0]
+        targets = data[1]
+            
+        loss_dict = model(samples, targets)
+        losses = sum(loss for loss in loss_dict.values())
+        losses.backward()
+
+        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+            optimizer.step()
+            optimizer.clear_grad()
+
+        # logging losses
+        batch_size = samples.tensors.shape[0]
+        train_loss_cls_meter.update(loss_dict['loss_cls'].numpy()[0], batch_size)
+        train_loss_reg_meter.update(loss_dict['loss_reg'].numpy()[0], batch_size)
+        train_loss_rpn_cls_meter.update(loss_dict['loss_rpn_cls'].numpy()[0], batch_size)
+        train_loss_rpn_reg_meter.update(loss_dict['loss_rpn_reg'].numpy()[0], batch_size)
+    
+        if batch_id > 0 and batch_id % debug_steps == 0:
+            logger.info(
+                f"Train Step[{batch_id:04d}/{total_batch:04d}], " + 
+                f"Avg loss_cls: {train_loss_cls_meter.avg:.4f}, " + 
+                f"Avg loss_reg: {train_loss_reg_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_cls: {train_loss_rpn_cls_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_reg: {train_loss_rpn_reg_meter.avg:.4f}") 
+
+    train_time = time.time() - time_st
+    return (train_loss_cls_meter.avg,
+            train_loss_reg_meter.avg,
+            train_loss_rpn_cls_meter.avg,
+            train_loss_rpn_reg_meter.avg,
+            train_time)
+
+
+def validate(dataloader, model, base_ds, total_batch, debug_steps=100):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: criterion
+        postprocessors: postprocessor for generating bboxes
+        base_ds: COCO instance
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+    Returns:
+        val_loss_meter.avg
+        val_acc_meter.avg
+        val_time
+    """
+    model.eval()
+    time_st = time.time()
+
+    iou_types = ('bbox', )
+    coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            samples = data[0]
+            targets = data[1]
+
+            prediction = model(samples, targets)
+
+            if batch_id > 0 and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], done") 
+
+            #res = {target_id: output for target_id, output in zip(targets['image_id'], prediction)}
+            res = {}
+            for target_id, output in zip(targets['image_id'], prediction):
+                target_id = target_id.cpu().numpy()[0]
+                output = output.cpu().numpy()
+                if output.shape[0] != 0:
+                    pred_dict = {'boxes': output[:, 2::],
+                                 'scores': output[:, 1],
+                                 'labels': output[:, 0]}
+                    res[int(target_id)] = pred_dict
+                else:
+                    res[int(target_id)] = {}
+
+            if coco_evaluator is not None:
+                coco_evaluator.update(res)
+
+    if coco_evaluator is not None:
+        coco_evaluator.synchronize_between_processes()
+        coco_evaluator.accumulate()
+        stats_dict = coco_evaluator.summarize()
+        # for det only
+        all_eval_result = stats_dict['bbox']
+
+    val_time = time.time() - time_st
+    return val_time, all_eval_result
+
+
+def main_worker(*args):
+    # 0. Preparation
+    dist.init_parallel_env()
+    last_epoch = config.TRAIN.LAST_EPOCH
+    world_size = paddle.distributed.get_world_size()
+    local_rank = paddle.distributed.get_rank()
+    logger.info(f'----- world_size = {world_size}, local_rank = {local_rank}')
+    seed = config.SEED + local_rank
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # 1. Create model
+    model = build_swin_det(config)
+    model = paddle.DataParallel(model)
+    # 2. Create train and val dataloader
+    dataset_train, dataset_val = args[0], args[1]
+    total_batch_train = 0
+    if not config.EVAL:
+        dataloader_train = get_dataloader(dataset_train,
+                                      batch_size=config.DATA.BATCH_SIZE,
+                                      mode='train',
+                                      multi_gpu=True)
+        total_batch_train = len(dataloader_train)
+
+    dataloader_val = get_dataloader(dataset_val,
+                                batch_size=config.DATA.BATCH_SIZE_EVAL,
+                                mode='val',
+                                multi_gpu=True)
+    total_batch_val = len(dataloader_val)
+    base_ds = dataset_val.coco # pycocotools.coco.COCO(anno_file)
+
+    logging.info(f'----- Total # of train batch (single gpu): {total_batch_train}')
+    logging.info(f'----- Total # of val batch (single gpu): {total_batch_val}')
+    # 4. Define optimizer and lr_scheduler
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                       milestones=milestones,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            )
+    else:
+        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # 5. Load pretrained model / load resumt model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+
+        # if from classification weights, add prefix 'backbone' and set state dict
+        if sum(['backbone' in key for key in model_state.keys()]) == 0:
+            logger.info(f"----- Pretrained: Load backbone from {config.MODEL.PRETRAINED}")
+            new_model_state = dict()
+            for key, val in model_state.items():
+                new_model_state['backbone.' + key] = val
+            model.set_state_dict(new_model_state)
+        else:
+            logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+            model.set_state_dict(model_state)
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+    
+    # 6. Validation
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_time, all_eval_result = validate(
+            dataloader=dataloader_val,
+            model=model,
+            base_ds=base_ds,
+            total_batch=total_batch_val,
+            debug_steps=config.REPORT_FREQ)
+
+        logger.info('IoU metric: bbox')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+        logger.info(f"Val time: {val_time:.2f}")
+        return
+
+    # 6. Start training and validation
+    logging.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss_cls, train_loss_reg, train_loss_rpn_cls, train_loss_rpn_reg, train_time = train(
+            dataloader=dataloader_train,
+            model=model, 
+            base_ds=base_ds,
+            optimizer=optimizer, 
+            epoch=epoch,
+            total_batch=len(dataloader_train),
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER)
+        scheduler.step()
+
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss cls: {train_loss_cls:.4f}, " +
+                    f"Train Loss reg: {train_loss_reg:.4f}, " +
+                    f"Train Loss rpn cls: {train_loss_rpn_cls:.4f}, " +
+                    f"Train Loss rpn reg: {train_loss_rpn_reg:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_time, all_eval_result = validate(
+                dataloader=dataloader_val,
+                model=model,
+                base_ds=base_ds,
+                total_batch=total_batch_val,
+                debug_steps=config.REPORT_FREQ)
+
+            logger.info('IoU metric: bbox')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+            logger.info(f"Val time: {val_time:.2f}")
+
+        # model save
+        if local_rank == 0:
+            if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+                model_path = os.path.join(
+                    config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+                paddle.save(model.state_dict(), model_path + '.pdparams')
+                paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+                logger.info(f"----- Save model: {model_path}.pdparams")
+                logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+def main():
+    if not config.EVAL:
+        dataset_train = build_coco('train', config.DATA.DATA_PATH)
+    else:
+        dataset_train = None
+    dataset_val = build_coco('val', config.DATA.DATA_PATH)
+    config.NGPUS = len(paddle.static.cuda_places()) if config.NGPUS == -1 else config.NGPUS
+    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/Swin/main_single_gpu.py b/object_detection/Swin/main_single_gpu.py
new file mode 100644
index 00000000..4cecc654
--- /dev/null
+++ b/object_detection/Swin/main_single_gpu.py
@@ -0,0 +1,399 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Swin Det training/validation using single GPU """
+
+import sys
+import os
+import time
+import logging
+import argparse
+import random
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+import paddle.distributed as dist
+from coco import build_coco
+from coco import get_dataloader
+from coco_eval import CocoEvaluator
+from swin_det import build_swin_det
+from utils import AverageMeter
+from utils import WarmupCosineScheduler
+from config import get_config
+from config import update_config
+
+
+parser = argparse.ArgumentParser('Swin-Det')
+parser.add_argument('-cfg', type=str, default=None)
+parser.add_argument('-dataset', type=str, default=None)
+parser.add_argument('-batch_size', type=int, default=None)
+parser.add_argument('-data_path', type=str, default=None)
+parser.add_argument('-backbone', type=str, default=None)
+parser.add_argument('-ngpus', type=int, default=None)
+parser.add_argument('-pretrained', type=str, default=None)
+parser.add_argument('-resume', type=str, default=None)
+parser.add_argument('-last_epoch', type=int, default=None)
+parser.add_argument('-eval', action='store_true')
+arguments = parser.parse_args()
+
+log_format = "%(asctime)s %(message)s"
+logging.basicConfig(stream=sys.stdout, level=logging.INFO,
+                    format=log_format, datefmt="%m%d %I:%M:%S %p")
+
+# get default config
+config = get_config()
+# update config by arguments
+config = update_config(config, arguments)
+
+# set output folder
+if not config.EVAL:
+    config.SAVE = '{}/train-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+else:
+    config.SAVE = '{}/eval-{}'.format(config.SAVE, time.strftime('%Y%m%d-%H-%M-%S'))
+
+#config.freeze()
+
+if not os.path.exists(config.SAVE):
+    os.makedirs(config.SAVE, exist_ok=True)
+
+# set logging format
+logger = logging.getLogger()
+fh = logging.FileHandler(os.path.join(config.SAVE, 'log.txt'))
+fh.setFormatter(logging.Formatter(log_format))
+logger.addHandler(fh)
+logger.info(f'config= {config}')
+
+
+def train(dataloader,
+          model,
+          base_ds,
+          optimizer,
+          epoch,
+          total_batch,
+          debug_steps=100,
+          accum_iter=1):
+    """Training for one epoch
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, det model
+        base_ds: coco api instance
+        optimizer: optimizer
+        epoch: int, current epoch
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+        accum_iter: int, num of iters for accumulating gradients
+    Returns:
+        train_loss_cls_meter.avg
+        train_loss_reg_meter.avg
+        train_loss_rpn_cls_meter.avg
+        train_loss_rpn_reg_meter.avg
+        train_time
+    """
+    model.train()
+    train_loss_cls_meter = AverageMeter()
+    train_loss_reg_meter = AverageMeter()
+    train_loss_rpn_cls_meter = AverageMeter()
+    train_loss_rpn_reg_meter = AverageMeter()
+
+    time_st = time.time()
+
+    #iou_types = ('bbox', )
+    #coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    for batch_id, data in enumerate(dataloader):
+        samples = data[0]
+        targets = data[1]
+            
+        loss_dict = model(samples, targets)
+        losses = sum(loss for loss in loss_dict.values())
+        losses.backward()
+
+        if ((batch_id +1) % accum_iter == 0) or (batch_id + 1 == len(dataloader)):
+            optimizer.step()
+            optimizer.clear_grad()
+
+        # logging losses
+        batch_size = samples.tensors.shape[0]
+        train_loss_cls_meter.update(loss_dict['loss_cls'].numpy()[0], batch_size)
+        train_loss_reg_meter.update(loss_dict['loss_reg'].numpy()[0], batch_size)
+        train_loss_rpn_cls_meter.update(loss_dict['loss_rpn_cls'].numpy()[0], batch_size)
+        train_loss_rpn_reg_meter.update(loss_dict['loss_rpn_reg'].numpy()[0], batch_size)
+    
+        if batch_id > 0 and batch_id % debug_steps == 0:
+            logger.info(
+                f"Train Step[{batch_id:04d}/{total_batch:04d}], " + 
+                f"Avg loss_cls: {train_loss_cls_meter.avg:.4f}, " + 
+                f"Avg loss_reg: {train_loss_reg_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_cls: {train_loss_rpn_cls_meter.avg:.4f}, " + 
+                f"Avg loss_rpn_reg: {train_loss_rpn_reg_meter.avg:.4f}") 
+
+    train_time = time.time() - time_st
+    return (train_loss_cls_meter.avg,
+            train_loss_reg_meter.avg,
+            train_loss_rpn_cls_meter.avg,
+            train_loss_rpn_reg_meter.avg,
+            train_time)
+
+
+def validate(dataloader, model, base_ds, total_batch, debug_steps=100):
+    """Validation for whole dataset
+    Args:
+        dataloader: paddle.io.DataLoader, dataloader instance
+        model: nn.Layer, a ViT model
+        criterion: criterion
+        postprocessors: postprocessor for generating bboxes
+        base_ds: COCO instance
+        total_epoch: int, total num of epoch, for logging
+        debug_steps: int, num of iters to log info
+    Returns:
+        val_loss_meter.avg
+        val_acc_meter.avg
+        val_time
+    """
+    model.eval()
+    time_st = time.time()
+
+    iou_types = ('bbox', )
+    coco_evaluator = CocoEvaluator(base_ds, iou_types)
+
+    with paddle.no_grad():
+        for batch_id, data in enumerate(dataloader):
+            samples = data[0]
+            targets = data[1]
+
+            prediction = model(samples, targets)
+
+            if batch_id > 0 and batch_id % debug_steps == 0:
+                logger.info(
+                    f"Val Step[{batch_id:04d}/{total_batch:04d}], done") 
+
+            #res = {target_id: output for target_id, output in zip(targets['image_id'], prediction)}
+            res = {}
+            for target_id, output in zip(targets['image_id'], prediction):
+                target_id = target_id.cpu().numpy()[0]
+                output = output.cpu().numpy()
+                if output.shape[0] != 0:
+                    pred_dict = {'boxes': output[:, 2::],
+                                 'scores': output[:, 1],
+                                 'labels': output[:, 0]}
+                    res[int(target_id)] = pred_dict
+                else:
+                    res[int(target_id)] = {}
+
+            if coco_evaluator is not None:
+                coco_evaluator.update(res)
+
+    if coco_evaluator is not None:
+        coco_evaluator.synchronize_between_processes()
+        coco_evaluator.accumulate()
+        stats_dict = coco_evaluator.summarize()
+        # for det only
+        all_eval_result = stats_dict['bbox']
+
+    val_time = time.time() - time_st
+    return val_time, all_eval_result
+
+
+def main():
+    # 0. Preparation
+    last_epoch = config.TRAIN.LAST_EPOCH
+    seed = config.SEED
+    paddle.seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    # 1. Create model and criterion
+    model = build_swin_det(config)
+    # 2. Create train and val dataloader
+    if not config.EVAL:
+        dataset_train = build_coco('train', config.DATA.DATA_PATH)
+        dataloader_train = get_dataloader(dataset_train,
+                                      batch_size=config.DATA.BATCH_SIZE,
+                                      mode='train', 
+                                      multi_gpu=False)
+
+    dataset_val = build_coco('val', config.DATA.DATA_PATH)
+    dataloader_val = get_dataloader(dataset_val,
+                                batch_size=config.DATA.BATCH_SIZE_EVAL,
+                                mode='val', 
+                                multi_gpu=False)
+
+    base_ds = dataset_val.coco   # pycocotools.coco.COCO(anno_file)
+    # 3. Define lr_scheduler
+    scheduler = None
+    if config.TRAIN.LR_SCHEDULER.NAME == "warmupcosine":
+        scheduler = WarmupCosineScheduler(learning_rate=config.TRAIN.BASE_LR,
+                                          warmup_start_lr=config.TRAIN.WARMUP_START_LR,
+                                          start_lr=config.TRAIN.BASE_LR,
+                                          end_lr=config.TRAIN.END_LR,
+                                          warmup_epochs=config.TRAIN.WARMUP_EPOCHS,
+                                          total_epochs=config.TRAIN.NUM_EPOCHS,
+                                          last_epoch=config.TRAIN.LAST_EPOCH,
+                                          )
+    elif config.TRAIN.LR_SCHEDULER.NAME == "cosine":
+        scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=config.TRAIN.BASE_LR, 
+                                                             T_max=config.TRAIN.NUM_EPOCHS,
+                                                             last_epoch=last_epoch)
+    elif config.scheduler == "multi-step":
+        milestones = [int(v.strip()) for v in config.TRAIN.LR_SCHEDULER.MILESTONES.split(",")]
+        scheduler = paddle.optimizer.lr.MultiStepDecay(learning_rate=config.TRAIN.BASE_LR, 
+                                                       milestones=milestons,
+                                                       gamma=config.TRAIN.LR_SCHEDULER.DECAY_RATE,
+                                                       last_epoch=last_epoch)
+    else:
+        logging.fatal(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+        raise NotImplementedError(f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
+
+    # 5. Define optimizer
+    if config.TRAIN.OPTIMIZER.NAME == "SGD":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.Momentum(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+            grad_clip=clip)
+    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
+        if config.TRAIN.GRAD_CLIP:
+            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
+        else:
+            clip = None
+        optimizer = paddle.optimizer.AdamW(
+            parameters=model.parameters(),
+            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
+            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+            weight_decay=config.TRAIN.WEIGHT_DECAY,
+            epsilon=config.TRAIN.OPTIMIZER.EPS,
+            grad_clip=clip,
+            #apply_decay_param_fun=get_exclude_from_weight_decay_fn(['pos_embed', 'cls_token']),
+            )
+    else:
+        logging.fatal(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+        raise NotImplementedError(f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
+
+    # 6. Load pretrained model or load resume model and optimizer states
+    if config.MODEL.PRETRAINED:
+        if (config.MODEL.PRETRAINED).endswith('.pdparams'):
+            raise ValueError(f'{config.MODEL.PRETRAINED} should not contain .pdparams')
+        assert os.path.isfile(config.MODEL.PRETRAINED + '.pdparams') is True
+        model_state = paddle.load(config.MODEL.PRETRAINED+'.pdparams')
+
+        # if from classification weights, add prefix 'backbone' and set state dict
+        if sum(['backbone' in key for key in model_state.keys()]) == 0:
+            logger.info(f"----- Pretrained: Load backbone from {config.MODEL.PRETRAINED}")
+            new_model_state = dict()
+            for key, val in model_state.items():
+                new_model_state['backbone.' + key] = val
+            model.set_state_dict(new_model_state)
+        else:
+            logger.info(f"----- Pretrained: Load model state from {config.MODEL.PRETRAINED}")
+            model.set_state_dict(model_state)
+
+    if config.MODEL.RESUME:
+        assert os.path.isfile(config.MODEL.RESUME+'.pdparams') is True
+        assert os.path.isfile(config.MODEL.RESUME+'.pdopt') is True
+        model_state = paddle.load(config.MODEL.RESUME+'.pdparams')
+        model.set_dict(model_state)
+        opt_state = paddle.load(config.MODEL.RESUME+'.pdopt')
+        optimizer.set_state_dict(opt_state)
+        logger.info(
+            f"----- Resume Training: Load model and optmizer states from {config.MODEL.RESUME}")
+    
+    # 6. Validation
+    if config.EVAL:
+        logger.info('----- Start Validating')
+        val_time, all_eval_result = validate(
+            dataloader=dataloader_val,
+            model=model,
+            base_ds=base_ds,
+            total_batch=len(dataloader_val),
+            debug_steps=config.REPORT_FREQ)
+ 
+        logger.info('IoU metric: bbox')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+        logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+        logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+        logger.info(f"Val time: {val_time:.2f}")
+        return
+
+    # 8. Start training and validation
+    logging.info(f"Start training from epoch {last_epoch+1}.")
+    for epoch in range(last_epoch+1, config.TRAIN.NUM_EPOCHS+1):
+        # train
+        logging.info(f"Now training epoch {epoch}. LR={optimizer.get_lr():.6f}")
+        train_loss_cls, train_loss_reg, train_loss_rpn_cls, train_loss_rpn_reg, train_time = train(
+            dataloader=dataloader_train,
+            model=model, 
+            base_ds=base_ds,
+            optimizer=optimizer, 
+            epoch=epoch,
+            total_batch=len(dataloader_train),
+            debug_steps=config.REPORT_FREQ,
+            accum_iter=config.TRAIN.ACCUM_ITER)
+        scheduler.step()
+        logger.info(f"----- Epoch[{epoch:03d}/{config.TRAIN.NUM_EPOCHS:03d}], " +
+                    f"Train Loss cls: {train_loss_cls:.4f}, " +
+                    f"Train Loss reg: {train_loss_reg:.4f}, " +
+                    f"Train Loss rpn cls: {train_loss_rpn_cls:.4f}, " +
+                    f"Train Loss rpn reg: {train_loss_rpn_reg:.4f}, " +
+                    f"time: {train_time:.2f}")
+        # validation
+        if epoch % config.VALIDATE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            logger.info(f'----- Validation after Epoch: {epoch}')
+            val_time, all_eval_result = validate(
+        	    dataloader=dataloader_val,
+        	    model=model,
+        	    base_ds=base_ds,
+        	    total_batch=len(dataloader_val),
+        	    debug_steps=config.REPORT_FREQ)
+ 
+            logger.info('IoU metric: bbox')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[0]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[1]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.75":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[2]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" small":>6s} | maxDets={100:>3d} ] = {all_eval_result[3]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[4]:0.3f}')
+            logger.info(f'{"Average Precision":<18} (AP) @[ IoU={"0.50:0.95":<9} | area={" large":>6s} | maxDets={100:>3d} ] = {all_eval_result[5]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={1:>3d} ] = {all_eval_result[6]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={10:>3d} ] = {all_eval_result[7]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"all":>6s} | maxDets={100:>3d} ] = {all_eval_result[8]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"small":>6s} | maxDets={100:>3d} ] = {all_eval_result[9]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"medium":>6s} | maxDets={100:>3d} ] = {all_eval_result[10]:0.3f}')
+            logger.info(f'{"Average Recall":<18} (AR) @[ IoU={"0.50:0.95":<9} | area={"large":>6s} | maxDets={100:>3d} ] = {all_eval_result[11]:0.3f}')
+            logger.info(f"Val time: {val_time:.2f}")
+
+        # model save
+        if epoch % config.SAVE_FREQ == 0 or epoch == config.TRAIN.NUM_EPOCHS:
+            model_path = os.path.join(config.SAVE, f"{config.MODEL.TYPE}-Epoch-{epoch}-Loss-{train_loss}")
+            paddle.save(model.state_dict(), model_path + '.pdparams')
+            paddle.save(optimizer.state_dict(), model_path + '.pdopt')
+            logger.info(f"----- Save model: {model_path}.pdparams")
+            logger.info(f"----- Save optim: {model_path}.pdopt")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/Swin/model_utils.py b/object_detection/Swin/model_utils.py
new file mode 100644
index 00000000..3ad98600
--- /dev/null
+++ b/object_detection/Swin/model_utils.py
@@ -0,0 +1,60 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Droppath, reimplement from https://github.com/yueatsprograms/Stochastic_Depth
+"""
+from itertools import repeat
+import collections.abc
+import numpy as np
+import paddle
+import paddle.nn as nn
+
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable):
+            return x
+        return tuple(repeat(x, n))
+    return parse
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
diff --git a/object_detection/Swin/nohup.out b/object_detection/Swin/nohup.out
new file mode 100644
index 00000000..54cf5bdf
--- /dev/null
+++ b/object_detection/Swin/nohup.out
@@ -0,0 +1,859 @@
+merging config from ./configs/swin_t_maskrcnn.yaml
+0906 06:42:32 PM config= AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 8
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.9
+  DATASET: coco
+  DATA_PATH: /dataset/coco
+  IMAGE_SIZE: 640
+  NUM_WORKERS: 2
+  VAL_DATA_PATH: /dataset/coco/
+  WEIGHT_PATH: ./weights/mask_rcnn_swin_small_patch4_window7.pdparams
+EVAL: True
+FPN:
+  IN_CHANNELS: [96, 192, 384, 768]
+  OUT_CHANNELS: 256
+  STRIDES: [4, 8, 16, 32]
+  USE_C5: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.1
+  DROP_PATH: 0.2
+  NAME: Swin
+  NUM_CLASSES: 1000
+  PRETRAINED: ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+  RESUME: None
+  TRANS:
+    APE: False
+    EMBED_DIM: 96
+    FROZEN_STAGES: -1
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: [3, 6, 12, 24]
+    OUT_INDICES: (0, 1, 2, 3)
+    PATCH_NORM: True
+    PATCH_SIZE: 4
+    PRETRAIN_IMAGE_SIZE: 224
+    QKV_BIAS: True
+    QK_SCALE: None
+    STAGE_DEPTHS: [2, 2, 6, 2]
+    WINDOW_SIZE: 7
+  TYPE: Swin
+NGPUS: -1
+REPORT_FREQ: 50
+ROI:
+  ALIGNED: True
+  ALIGN_OUTPUT_SIZE: 7
+  BATCH_SIZE_PER_IMG: 512
+  BOX_HEAD:
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NUM_CONV: 0
+    NUM_FC: 2
+    REG_WEIGHTS: [10.0, 10.0, 5.0, 5.0]
+  CANONICAL_BOX_SIZE: 224
+  CANONICAL_LEVEL: 4
+  LOW_QUALITY_MATCHES: False
+  MAX_LEVEL: 3
+  MIN_LEVEL: 0
+  NEGATIVE_THRESH: 0.5
+  NMS_KEEP_TOPK_INFER: 100
+  NMS_THRESH_INFER: 0.5
+  NUM_ClASSES: 80
+  PAT_GT: False
+  POSITIVE_FRACTION: 0.25
+  POSITIVE_THRESH: 0.5
+  SAMPLING_RATIO: 0
+  SCALES: [0.25, 0.125, 0.0625, 0.03125, 0.015625]
+  SCORE_THRESH_INFER: 0.05
+RPN:
+  ANCHOR_SIZE: [[32], [64], [128], [256], [512]]
+  ASPECT_RATIOS: [0.5, 1.0, 2.0]
+  BATCH_SIZE_PER_IMG: 256
+  LOW_QUALITY_MATCHES: True
+  MIN_SIZE: 0.0
+  NEGATIVE_THRESH: 0.3
+  NMS_THRESH: 0.7
+  OFFSET: 0.0
+  POSITIVE_FRACTION: 0.5
+  POSITIVE_THRESH: 0.7
+  POST_NMS_TOP_N_TEST: 1000
+  POST_NMS_TOP_N_TRAIN: 1000
+  PRE_NMS_TOP_N_TEST: 1000
+  PRE_NMS_TOP_N_TRAIN: 2000
+  STRIDES: [4, 8, 16, 32, 64]
+  TOPK_AFTER_COLLECT: True
+SAVE: ./output/eval-20210906-18-42-32
+SAVE_FREQ: 20
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.0001
+  END_LR: 0.0
+  GRAD_CLIP: 0.1
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: SGD
+  WARMUP_EPOCHS: 20
+  WARMUP_START_LR: 0.0
+  WEIGHT_DECAY: 0.0001
+VALIDATE_FREQ: 20
+W0906 18:42:32.705507 24069 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W0906 18:42:32.712786 24069 device_context.cc:422] device: 0, cuDNN Version: 7.6.
+loading annotations into memory...
+Done (t=0.59s)
+creating index...
+index created!
+loading coco data, 48 imgs without annos are removed
+0906 06:42:44 PM ----- Pretrained: Load model state from ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+0906 06:42:44 PM ----- Start Validating
+merging config from ./configs/swin_t_maskrcnn.yaml
+0906 06:42:46 PM config= AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 8
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.9
+  DATASET: coco
+  DATA_PATH: /dataset/coco
+  IMAGE_SIZE: 640
+  NUM_WORKERS: 2
+  VAL_DATA_PATH: /dataset/coco/
+  WEIGHT_PATH: ./weights/mask_rcnn_swin_small_patch4_window7.pdparams
+EVAL: True
+FPN:
+  IN_CHANNELS: [96, 192, 384, 768]
+  OUT_CHANNELS: 256
+  STRIDES: [4, 8, 16, 32]
+  USE_C5: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.1
+  DROP_PATH: 0.2
+  NAME: Swin
+  NUM_CLASSES: 1000
+  PRETRAINED: ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+  RESUME: None
+  TRANS:
+    APE: False
+    EMBED_DIM: 96
+    FROZEN_STAGES: -1
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: [3, 6, 12, 24]
+    OUT_INDICES: (0, 1, 2, 3)
+    PATCH_NORM: True
+    PATCH_SIZE: 4
+    PRETRAIN_IMAGE_SIZE: 224
+    QKV_BIAS: True
+    QK_SCALE: None
+    STAGE_DEPTHS: [2, 2, 6, 2]
+    WINDOW_SIZE: 7
+  TYPE: Swin
+NGPUS: -1
+REPORT_FREQ: 50
+ROI:
+  ALIGNED: True
+  ALIGN_OUTPUT_SIZE: 7
+  BATCH_SIZE_PER_IMG: 512
+  BOX_HEAD:
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NUM_CONV: 0
+    NUM_FC: 2
+    REG_WEIGHTS: [10.0, 10.0, 5.0, 5.0]
+  CANONICAL_BOX_SIZE: 224
+  CANONICAL_LEVEL: 4
+  LOW_QUALITY_MATCHES: False
+  MAX_LEVEL: 3
+  MIN_LEVEL: 0
+  NEGATIVE_THRESH: 0.5
+  NMS_KEEP_TOPK_INFER: 100
+  NMS_THRESH_INFER: 0.5
+  NUM_ClASSES: 80
+  PAT_GT: False
+  POSITIVE_FRACTION: 0.25
+  POSITIVE_THRESH: 0.5
+  SAMPLING_RATIO: 0
+  SCALES: [0.25, 0.125, 0.0625, 0.03125, 0.015625]
+  SCORE_THRESH_INFER: 0.05
+RPN:
+  ANCHOR_SIZE: [[32], [64], [128], [256], [512]]
+  ASPECT_RATIOS: [0.5, 1.0, 2.0]
+  BATCH_SIZE_PER_IMG: 256
+  LOW_QUALITY_MATCHES: True
+  MIN_SIZE: 0.0
+  NEGATIVE_THRESH: 0.3
+  NMS_THRESH: 0.7
+  OFFSET: 0.0
+  POSITIVE_FRACTION: 0.5
+  POSITIVE_THRESH: 0.7
+  POST_NMS_TOP_N_TEST: 1000
+  POST_NMS_TOP_N_TRAIN: 1000
+  PRE_NMS_TOP_N_TEST: 1000
+  PRE_NMS_TOP_N_TRAIN: 2000
+  STRIDES: [4, 8, 16, 32, 64]
+  TOPK_AFTER_COLLECT: True
+SAVE: ./output/eval-20210906-18-42-46
+SAVE_FREQ: 20
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.0001
+  END_LR: 0.0
+  GRAD_CLIP: 0.1
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: SGD
+  WARMUP_EPOCHS: 20
+  WARMUP_START_LR: 0.0
+  WEIGHT_DECAY: 0.0001
+VALIDATE_FREQ: 20
+loading annotations into memory...
+Done (t=0.57s)
+creating index...
+index created!
+loading coco data, 48 imgs without annos are removed
+merging config from ./configs/swin_t_maskrcnn.yaml
+0906 06:42:49 PM config= AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 8
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.9
+  DATASET: coco
+  DATA_PATH: /dataset/coco
+  IMAGE_SIZE: 640
+  NUM_WORKERS: 2
+  VAL_DATA_PATH: /dataset/coco/
+  WEIGHT_PATH: ./weights/mask_rcnn_swin_small_patch4_window7.pdparams
+EVAL: True
+FPN:
+  IN_CHANNELS: [96, 192, 384, 768]
+  OUT_CHANNELS: 256
+  STRIDES: [4, 8, 16, 32]
+  USE_C5: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.1
+  DROP_PATH: 0.2
+  NAME: Swin
+  NUM_CLASSES: 1000
+  PRETRAINED: ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+  RESUME: None
+  TRANS:
+    APE: False
+    EMBED_DIM: 96
+    FROZEN_STAGES: -1
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: [3, 6, 12, 24]
+    OUT_INDICES: (0, 1, 2, 3)
+    PATCH_NORM: True
+    PATCH_SIZE: 4
+    PRETRAIN_IMAGE_SIZE: 224
+    QKV_BIAS: True
+    QK_SCALE: None
+    STAGE_DEPTHS: [2, 2, 6, 2]
+    WINDOW_SIZE: 7
+  TYPE: Swin
+NGPUS: -1
+REPORT_FREQ: 50
+ROI:
+  ALIGNED: True
+  ALIGN_OUTPUT_SIZE: 7
+  BATCH_SIZE_PER_IMG: 512
+  BOX_HEAD:
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NUM_CONV: 0
+    NUM_FC: 2
+    REG_WEIGHTS: [10.0, 10.0, 5.0, 5.0]
+  CANONICAL_BOX_SIZE: 224
+  CANONICAL_LEVEL: 4
+  LOW_QUALITY_MATCHES: False
+  MAX_LEVEL: 3
+  MIN_LEVEL: 0
+  NEGATIVE_THRESH: 0.5
+  NMS_KEEP_TOPK_INFER: 100
+  NMS_THRESH_INFER: 0.5
+  NUM_ClASSES: 80
+  PAT_GT: False
+  POSITIVE_FRACTION: 0.25
+  POSITIVE_THRESH: 0.5
+  SAMPLING_RATIO: 0
+  SCALES: [0.25, 0.125, 0.0625, 0.03125, 0.015625]
+  SCORE_THRESH_INFER: 0.05
+RPN:
+  ANCHOR_SIZE: [[32], [64], [128], [256], [512]]
+  ASPECT_RATIOS: [0.5, 1.0, 2.0]
+  BATCH_SIZE_PER_IMG: 256
+  LOW_QUALITY_MATCHES: True
+  MIN_SIZE: 0.0
+  NEGATIVE_THRESH: 0.3
+  NMS_THRESH: 0.7
+  OFFSET: 0.0
+  POSITIVE_FRACTION: 0.5
+  POSITIVE_THRESH: 0.7
+  POST_NMS_TOP_N_TEST: 1000
+  POST_NMS_TOP_N_TRAIN: 1000
+  PRE_NMS_TOP_N_TEST: 1000
+  PRE_NMS_TOP_N_TRAIN: 2000
+  STRIDES: [4, 8, 16, 32, 64]
+  TOPK_AFTER_COLLECT: True
+SAVE: ./output/eval-20210906-18-42-49
+SAVE_FREQ: 20
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.0001
+  END_LR: 0.0
+  GRAD_CLIP: 0.1
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: SGD
+  WARMUP_EPOCHS: 20
+  WARMUP_START_LR: 0.0
+  WEIGHT_DECAY: 0.0001
+VALIDATE_FREQ: 20
+W0906 18:42:49.483649 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:43378 failed 1 times with reason: Connection refused retry after 0.5 seconds
+W0906 18:42:49.983848 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:43378 failed 2 times with reason: Connection refused retry after 1 seconds
+W0906 18:42:50.984026 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:43378 failed 3 times with reason: Connection refused retry after 1.5 seconds
+merging config from ./configs/swin_t_maskrcnn.yaml
+0906 06:42:51 PM config= AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 8
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.9
+  DATASET: coco
+  DATA_PATH: /dataset/coco
+  IMAGE_SIZE: 640
+  NUM_WORKERS: 2
+  VAL_DATA_PATH: /dataset/coco/
+  WEIGHT_PATH: ./weights/mask_rcnn_swin_small_patch4_window7.pdparams
+EVAL: True
+FPN:
+  IN_CHANNELS: [96, 192, 384, 768]
+  OUT_CHANNELS: 256
+  STRIDES: [4, 8, 16, 32]
+  USE_C5: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.1
+  DROP_PATH: 0.2
+  NAME: Swin
+  NUM_CLASSES: 1000
+  PRETRAINED: ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+  RESUME: None
+  TRANS:
+    APE: False
+    EMBED_DIM: 96
+    FROZEN_STAGES: -1
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: [3, 6, 12, 24]
+    OUT_INDICES: (0, 1, 2, 3)
+    PATCH_NORM: True
+    PATCH_SIZE: 4
+    PRETRAIN_IMAGE_SIZE: 224
+    QKV_BIAS: True
+    QK_SCALE: None
+    STAGE_DEPTHS: [2, 2, 6, 2]
+    WINDOW_SIZE: 7
+  TYPE: Swin
+NGPUS: -1
+REPORT_FREQ: 50
+ROI:
+  ALIGNED: True
+  ALIGN_OUTPUT_SIZE: 7
+  BATCH_SIZE_PER_IMG: 512
+  BOX_HEAD:
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NUM_CONV: 0
+    NUM_FC: 2
+    REG_WEIGHTS: [10.0, 10.0, 5.0, 5.0]
+  CANONICAL_BOX_SIZE: 224
+  CANONICAL_LEVEL: 4
+  LOW_QUALITY_MATCHES: False
+  MAX_LEVEL: 3
+  MIN_LEVEL: 0
+  NEGATIVE_THRESH: 0.5
+  NMS_KEEP_TOPK_INFER: 100
+  NMS_THRESH_INFER: 0.5
+  NUM_ClASSES: 80
+  PAT_GT: False
+  POSITIVE_FRACTION: 0.25
+  POSITIVE_THRESH: 0.5
+  SAMPLING_RATIO: 0
+  SCALES: [0.25, 0.125, 0.0625, 0.03125, 0.015625]
+  SCORE_THRESH_INFER: 0.05
+RPN:
+  ANCHOR_SIZE: [[32], [64], [128], [256], [512]]
+  ASPECT_RATIOS: [0.5, 1.0, 2.0]
+  BATCH_SIZE_PER_IMG: 256
+  LOW_QUALITY_MATCHES: True
+  MIN_SIZE: 0.0
+  NEGATIVE_THRESH: 0.3
+  NMS_THRESH: 0.7
+  OFFSET: 0.0
+  POSITIVE_FRACTION: 0.5
+  POSITIVE_THRESH: 0.7
+  POST_NMS_TOP_N_TEST: 1000
+  POST_NMS_TOP_N_TRAIN: 1000
+  PRE_NMS_TOP_N_TEST: 1000
+  PRE_NMS_TOP_N_TRAIN: 2000
+  STRIDES: [4, 8, 16, 32, 64]
+  TOPK_AFTER_COLLECT: True
+SAVE: ./output/eval-20210906-18-42-51
+SAVE_FREQ: 20
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.0001
+  END_LR: 0.0
+  GRAD_CLIP: 0.1
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: SGD
+  WARMUP_EPOCHS: 20
+  WARMUP_START_LR: 0.0
+  WEIGHT_DECAY: 0.0001
+VALIDATE_FREQ: 20
+I0906 18:42:51.917227 24119 gen_comm_id_helper.cc:181] Server listening on: 127.0.0.1:43378 successful.
+W0906 18:42:52.484287 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:11180 failed 1 times with reason: Connection refused retry after 0.5 seconds
+W0906 18:42:52.984428 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:11180 failed 2 times with reason: Connection refused retry after 1 seconds
+W0906 18:42:53.984596 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:11180 failed 3 times with reason: Connection refused retry after 1.5 seconds
+merging config from ./configs/swin_t_maskrcnn.yaml
+0906 06:42:53 PM config= AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 8
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.9
+  DATASET: coco
+  DATA_PATH: /dataset/coco
+  IMAGE_SIZE: 640
+  NUM_WORKERS: 2
+  VAL_DATA_PATH: /dataset/coco/
+  WEIGHT_PATH: ./weights/mask_rcnn_swin_small_patch4_window7.pdparams
+EVAL: True
+FPN:
+  IN_CHANNELS: [96, 192, 384, 768]
+  OUT_CHANNELS: 256
+  STRIDES: [4, 8, 16, 32]
+  USE_C5: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.1
+  DROP_PATH: 0.2
+  NAME: Swin
+  NUM_CLASSES: 1000
+  PRETRAINED: ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+  RESUME: None
+  TRANS:
+    APE: False
+    EMBED_DIM: 96
+    FROZEN_STAGES: -1
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: [3, 6, 12, 24]
+    OUT_INDICES: (0, 1, 2, 3)
+    PATCH_NORM: True
+    PATCH_SIZE: 4
+    PRETRAIN_IMAGE_SIZE: 224
+    QKV_BIAS: True
+    QK_SCALE: None
+    STAGE_DEPTHS: [2, 2, 6, 2]
+    WINDOW_SIZE: 7
+  TYPE: Swin
+NGPUS: -1
+REPORT_FREQ: 50
+ROI:
+  ALIGNED: True
+  ALIGN_OUTPUT_SIZE: 7
+  BATCH_SIZE_PER_IMG: 512
+  BOX_HEAD:
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NUM_CONV: 0
+    NUM_FC: 2
+    REG_WEIGHTS: [10.0, 10.0, 5.0, 5.0]
+  CANONICAL_BOX_SIZE: 224
+  CANONICAL_LEVEL: 4
+  LOW_QUALITY_MATCHES: False
+  MAX_LEVEL: 3
+  MIN_LEVEL: 0
+  NEGATIVE_THRESH: 0.5
+  NMS_KEEP_TOPK_INFER: 100
+  NMS_THRESH_INFER: 0.5
+  NUM_ClASSES: 80
+  PAT_GT: False
+  POSITIVE_FRACTION: 0.25
+  POSITIVE_THRESH: 0.5
+  SAMPLING_RATIO: 0
+  SCALES: [0.25, 0.125, 0.0625, 0.03125, 0.015625]
+  SCORE_THRESH_INFER: 0.05
+RPN:
+  ANCHOR_SIZE: [[32], [64], [128], [256], [512]]
+  ASPECT_RATIOS: [0.5, 1.0, 2.0]
+  BATCH_SIZE_PER_IMG: 256
+  LOW_QUALITY_MATCHES: True
+  MIN_SIZE: 0.0
+  NEGATIVE_THRESH: 0.3
+  NMS_THRESH: 0.7
+  OFFSET: 0.0
+  POSITIVE_FRACTION: 0.5
+  POSITIVE_THRESH: 0.7
+  POST_NMS_TOP_N_TEST: 1000
+  POST_NMS_TOP_N_TRAIN: 1000
+  PRE_NMS_TOP_N_TEST: 1000
+  PRE_NMS_TOP_N_TRAIN: 2000
+  STRIDES: [4, 8, 16, 32, 64]
+  TOPK_AFTER_COLLECT: True
+SAVE: ./output/eval-20210906-18-42-53
+SAVE_FREQ: 20
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.0001
+  END_LR: 0.0
+  GRAD_CLIP: 0.1
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: SGD
+  WARMUP_EPOCHS: 20
+  WARMUP_START_LR: 0.0
+  WEIGHT_DECAY: 0.0001
+VALIDATE_FREQ: 20
+I0906 18:42:54.355432 24133 gen_comm_id_helper.cc:181] Server listening on: 127.0.0.1:11180 successful.
+W0906 18:42:55.484872 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:24060 failed 1 times with reason: Connection refused retry after 0.5 seconds
+W0906 18:42:55.985054 24104 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:24060 failed 2 times with reason: Connection refused retry after 1 seconds
+merging config from ./configs/swin_t_maskrcnn.yaml
+0906 06:42:56 PM config= AUG:
+  AUTO_AUGMENT: rand-m9-mstd0.5-inc1
+  COLOR_JITTER: 0.4
+  CUTMIX: 1.0
+  CUTMIX_MINMAX: None
+  MIXUP: 0.8
+  MIXUP_MODE: batch
+  MIXUP_PROB: 1.0
+  MIXUP_SWITCH_PROB: 0.5
+  RE_COUNT: 1
+  RE_MODE: pixel
+  RE_PROB: 0.25
+BASE: ['']
+DATA:
+  BATCH_SIZE: 8
+  BATCH_SIZE_EVAL: 8
+  CROP_PCT: 0.9
+  DATASET: coco
+  DATA_PATH: /dataset/coco
+  IMAGE_SIZE: 640
+  NUM_WORKERS: 2
+  VAL_DATA_PATH: /dataset/coco/
+  WEIGHT_PATH: ./weights/mask_rcnn_swin_small_patch4_window7.pdparams
+EVAL: True
+FPN:
+  IN_CHANNELS: [96, 192, 384, 768]
+  OUT_CHANNELS: 256
+  STRIDES: [4, 8, 16, 32]
+  USE_C5: False
+LOCAL_RANK: 0
+MODEL:
+  ATTENTION_DROPOUT: 0.0
+  DROPOUT: 0.1
+  DROP_PATH: 0.2
+  NAME: Swin
+  NUM_CLASSES: 1000
+  PRETRAINED: ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+  RESUME: None
+  TRANS:
+    APE: False
+    EMBED_DIM: 96
+    FROZEN_STAGES: -1
+    IN_CHANNELS: 3
+    MLP_RATIO: 4.0
+    NUM_HEADS: [3, 6, 12, 24]
+    OUT_INDICES: (0, 1, 2, 3)
+    PATCH_NORM: True
+    PATCH_SIZE: 4
+    PRETRAIN_IMAGE_SIZE: 224
+    QKV_BIAS: True
+    QK_SCALE: None
+    STAGE_DEPTHS: [2, 2, 6, 2]
+    WINDOW_SIZE: 7
+  TYPE: Swin
+NGPUS: -1
+REPORT_FREQ: 50
+ROI:
+  ALIGNED: True
+  ALIGN_OUTPUT_SIZE: 7
+  BATCH_SIZE_PER_IMG: 512
+  BOX_HEAD:
+    CONV_DIM: 256
+    FC_DIM: 1024
+    NUM_CONV: 0
+    NUM_FC: 2
+    REG_WEIGHTS: [10.0, 10.0, 5.0, 5.0]
+  CANONICAL_BOX_SIZE: 224
+  CANONICAL_LEVEL: 4
+  LOW_QUALITY_MATCHES: False
+  MAX_LEVEL: 3
+  MIN_LEVEL: 0
+  NEGATIVE_THRESH: 0.5
+  NMS_KEEP_TOPK_INFER: 100
+  NMS_THRESH_INFER: 0.5
+  NUM_ClASSES: 80
+  PAT_GT: False
+  POSITIVE_FRACTION: 0.25
+  POSITIVE_THRESH: 0.5
+  SAMPLING_RATIO: 0
+  SCALES: [0.25, 0.125, 0.0625, 0.03125, 0.015625]
+  SCORE_THRESH_INFER: 0.05
+RPN:
+  ANCHOR_SIZE: [[32], [64], [128], [256], [512]]
+  ASPECT_RATIOS: [0.5, 1.0, 2.0]
+  BATCH_SIZE_PER_IMG: 256
+  LOW_QUALITY_MATCHES: True
+  MIN_SIZE: 0.0
+  NEGATIVE_THRESH: 0.3
+  NMS_THRESH: 0.7
+  OFFSET: 0.0
+  POSITIVE_FRACTION: 0.5
+  POSITIVE_THRESH: 0.7
+  POST_NMS_TOP_N_TEST: 1000
+  POST_NMS_TOP_N_TRAIN: 1000
+  PRE_NMS_TOP_N_TEST: 1000
+  PRE_NMS_TOP_N_TRAIN: 2000
+  STRIDES: [4, 8, 16, 32, 64]
+  TOPK_AFTER_COLLECT: True
+SAVE: ./output/eval-20210906-18-42-56
+SAVE_FREQ: 20
+SEED: 0
+TAG: default
+TRAIN:
+  ACCUM_ITER: 2
+  BASE_LR: 0.0001
+  END_LR: 0.0
+  GRAD_CLIP: 0.1
+  LAST_EPOCH: 0
+  LR_SCHEDULER:
+    DECAY_EPOCHS: 30
+    DECAY_RATE: 0.1
+    MILESTONES: 30, 60, 90
+    NAME: warmupcosine
+  NUM_EPOCHS: 300
+  OPTIMIZER:
+    BETAS: (0.9, 0.999)
+    EPS: 1e-08
+    MOMENTUM: 0.9
+    NAME: SGD
+  WARMUP_EPOCHS: 20
+  WARMUP_START_LR: 0.0
+  WEIGHT_DECAY: 0.0001
+VALIDATE_FREQ: 20
+I0906 18:42:56.622375 24147 gen_comm_id_helper.cc:181] Server listening on: 127.0.0.1:24060 successful.
+I0906 18:42:56.985328 24104 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
+I0906 18:42:56.985343 24119 nccl_context.cc:74] init nccl context nranks: 4 local rank: 1 gpu id: 1 ring id: 0
+I0906 18:42:56.985344 24147 nccl_context.cc:74] init nccl context nranks: 4 local rank: 3 gpu id: 3 ring id: 0
+I0906 18:42:56.985340 24133 nccl_context.cc:74] init nccl context nranks: 4 local rank: 2 gpu id: 2 ring id: 0
+W0906 18:42:59.592660 24133 device_context.cc:404] Please NOTE: device: 2, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W0906 18:42:59.592682 24104 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W0906 18:42:59.592675 24119 device_context.cc:404] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W0906 18:42:59.592749 24147 device_context.cc:404] Please NOTE: device: 3, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.2
+W0906 18:42:59.597329 24104 device_context.cc:422] device: 0, cuDNN Version: 7.6.
+W0906 18:42:59.597333 24133 device_context.cc:422] device: 2, cuDNN Version: 7.6.
+W0906 18:42:59.597333 24119 device_context.cc:422] device: 1, cuDNN Version: 7.6.
+W0906 18:42:59.597337 24147 device_context.cc:422] device: 3, cuDNN Version: 7.6.
+0906 06:43:05 PM ----- world_size = 4, local_rank = 0
+0906 06:43:05 PM ----- world_size = 4, local_rank = 1
+0906 06:43:05 PM ----- world_size = 4, local_rank = 2
+0906 06:43:05 PM ----- world_size = 4, local_rank = 3
+0906 06:43:06 PM ----- Total # of train batch (single gpu): 0
+0906 06:43:06 PM ----- Total # of val batch (single gpu): 154
+0906 06:43:06 PM ----- Total # of train batch (single gpu): 0
+0906 06:43:06 PM ----- Total # of val batch (single gpu): 154
+0906 06:43:06 PM ----- Total # of train batch (single gpu): 0
+0906 06:43:06 PM ----- Total # of val batch (single gpu): 154
+0906 06:43:06 PM ----- Total # of train batch (single gpu): 0
+0906 06:43:06 PM ----- Total # of val batch (single gpu): 154
+0906 06:43:08 PM ----- Pretrained: Load model state from ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+0906 06:43:08 PM ----- Start Validating
+0906 06:43:08 PM ----- Pretrained: Load model state from ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+0906 06:43:08 PM ----- Start Validating
+0906 06:43:08 PM ----- Pretrained: Load model state from ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+0906 06:43:08 PM ----- Start Validating
+0906 06:43:08 PM ----- Pretrained: Load model state from ./weights/mask_rcnn_swin_tiny_patch4_window7_1x
+0906 06:43:08 PM ----- Start Validating
+0906 06:43:43 PM Val Step[0050/0619], done
+0906 06:44:08 PM Val Step[0050/0154], done
+0906 06:44:08 PM Val Step[0050/0154], done
+0906 06:44:08 PM Val Step[0050/0154], done
+0906 06:44:09 PM Val Step[0050/0154], done
+0906 06:44:37 PM Val Step[0100/0619], done
+0906 06:45:01 PM Val Step[0100/0154], done
+0906 06:45:02 PM Val Step[0100/0154], done
+0906 06:45:03 PM Val Step[0100/0154], done
+0906 06:45:03 PM Val Step[0100/0154], done
+0906 06:45:31 PM Val Step[0150/0619], done
+0906 06:45:53 PM Val Step[0150/0154], done
+0906 06:45:54 PM Val Step[0150/0154], done
+0906 06:45:54 PM Val Step[0150/0154], done
+0906 06:45:56 PM Val Step[0150/0154], done
+Traceback (most recent call last):
+  File "main_multi_gpu.py", line 383, in <module>
+    main()
+  File "main_multi_gpu.py", line 379, in main
+    dist.spawn(main_worker, args=(dataset_train, dataset_val, ), nprocs=config.NGPUS)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 496, in spawn
+    while not context.join():
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 296, in join
+    self._throw_exception(error_index)
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 314, in _throw_exception
+    raise Exception(msg)
+Exception: 
+
+----------------------------------------------
+Process 0 terminated with the following error:
+----------------------------------------------
+
+Traceback (most recent call last):
+  File "/opt/conda/envs/py36/lib/python3.6/site-packages/paddle/distributed/spawn.py", line 245, in _func_wrapper
+    result = func(*args)
+  File "/workspace/ppvit_github/PaddleViT/object_detection/Swin/main_multi_gpu.py", line 317, in main_worker
+    debug_steps=config.REPORT_FREQ)
+  File "/workspace/ppvit_github/PaddleViT/object_detection/Swin/main_multi_gpu.py", line 184, in validate
+    output = output.cpu().numpy()
+AttributeError: 'list' object has no attribute 'cpu'
+
+0906 06:46:25 PM Val Step[0200/0619], done
+0906 06:47:17 PM Val Step[0250/0619], done
+0906 06:48:11 PM Val Step[0300/0619], done
+0906 06:49:04 PM Val Step[0350/0619], done
+0906 06:49:57 PM Val Step[0400/0619], done
+0906 06:50:50 PM Val Step[0450/0619], done
+0906 06:51:41 PM Val Step[0500/0619], done
+0906 06:52:35 PM Val Step[0550/0619], done
+0906 06:53:26 PM Val Step[0600/0619], done
+Traceback (most recent call last):
+  File "main_single_gpu.py", line 310, in <module>
+    main()
+  File "main_single_gpu.py", line 256, in main
+    debug_steps=config.REPORT_FREQ)
+  File "main_single_gpu.py", line 144, in validate
+    output = output.cpu().numpy()
+AttributeError: 'list' object has no attribute 'cpu'
diff --git a/object_detection/Swin/random_erasing.py b/object_detection/Swin/random_erasing.py
new file mode 100644
index 00000000..a3f7d3b5
--- /dev/null
+++ b/object_detection/Swin/random_erasing.py
@@ -0,0 +1,108 @@
+import random
+import math
+import paddle
+
+
+def _get_pixels(per_pixel, rand_color, patch_size, dtype="float32"):
+    if per_pixel:
+        return paddle.normal(shape=patch_size).astype(dtype)
+    elif rand_color:
+        return paddle.normal(shape=(patch_size[0], 1, 1)).astype(dtype)
+    else:
+        return paddle.zeros((patch_size[0], 1, 1)).astype(dtype)
+
+class RandomErasing(object):
+    """
+    Args:
+        prob: probability of performing random erasing
+        min_area: Minimum percentage of erased area wrt input image area
+        max_area: Maximum percentage of erased area wrt input image area
+        min_aspect: Minimum aspect ratio of earsed area
+        max_aspect: Maximum aspect ratio of earsed area
+        mode: pixel color mode, in ['const', 'rand', 'pixel']
+            'const' - erase block is constant valued 0 for all channels
+            'rand'  - erase block is valued random color (same per-channel)  
+            'pixel' - erase block is vauled random color per pixel
+        min_count: Minimum # of ereasing blocks per image.
+        max_count: Maximum # of ereasing blocks per image. Area per box is scaled by count
+                   per-image count is randomly chosen between min_count to max_count
+    """
+    def __init__(self, prob=0.5, min_area=0.02, max_area=1/3, min_aspect=0.3, max_aspect=None,
+                 mode='const', min_count=1, max_count=None, num_splits=0):
+        self.prob = prob
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        if mode == "rand":
+            self.rand_color = True
+        elif mode == "pixel":
+            self.per_pixel = True
+        else:
+            assert not mode or mode == "const"
+
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.prob:
+            return
+        area = img_h * img_w
+        count = self.min_count if self.min_count == self.max_count else \
+            random.randint(self.min_count, self.max_count)
+        for _ in range(count):
+            for attempt in range(10):
+                target_area = random.uniform(self.min_area, self.max_area) * area / count
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                #print(h, w)
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    #print(top, left)
+
+                    img[:, top:top+h, left:left+w] = _get_pixels(
+                                self.per_pixel, self.rand_color, (chan, h, w),
+                                dtype=dtype)
+                    #print(_get_pixels(
+                    #            self.per_pixel, self.rand_color, (chan, h, w),
+                    #            dtype=dtype))
+                    break
+    
+    def __call__(self, input):
+        if len(input.shape) == 3:
+            self._erase(input, *input.shape, input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.shape
+            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
+            for i in range(batch_start, batch_size):
+                self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input
+
+
+
+def main():
+    re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='rand')
+    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='const')
+    #re = RandomErasing(prob=1.0, min_area=0.2, max_area=0.6, mode='pixel')
+    import PIL.Image as Image
+    import numpy as np
+    paddle.set_device('cpu')
+    img = paddle.to_tensor(np.asarray(Image.open('./lenna.png'))).astype('float32')
+    img = img / 255.0
+    img = paddle.transpose(img, [2, 0, 1])
+    new_img = re(img)
+    new_img = new_img * 255.0
+    new_img = paddle.transpose(new_img, [1, 2, 0])
+    new_img = new_img.cpu().numpy()
+    new_img = Image.fromarray(new_img.astype('uint8'))
+    new_img.save('./res.png')
+
+
+
+if __name__ == "__main__":
+    main()
diff --git a/object_detection/Swin/run_eval.sh b/object_detection/Swin/run_eval.sh
new file mode 100644
index 00000000..805913b8
--- /dev/null
+++ b/object_detection/Swin/run_eval.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+python main_single_gpu.py \
+-cfg='./configs/swin_t_maskrcnn.yaml' \
+-dataset='coco' \
+-batch_size=8 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./weights/mask_rcnn_swin_tiny_patch4_window7'
diff --git a/object_detection/Swin/run_eval_multi.sh b/object_detection/Swin/run_eval_multi.sh
new file mode 100644
index 00000000..7d15fafa
--- /dev/null
+++ b/object_detection/Swin/run_eval_multi.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=4,5,6,7 \
+python main_multi_gpu.py \
+-cfg='./configs/swin_t_maskrcnn.yaml' \
+-dataset='coco' \
+-batch_size=8 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./weights/mask_rcnn_swin_tiny_patch4_window7'
diff --git a/object_detection/Swin/run_eval_multi_s.sh b/object_detection/Swin/run_eval_multi_s.sh
new file mode 100644
index 00000000..947976aa
--- /dev/null
+++ b/object_detection/Swin/run_eval_multi_s.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+python main_multi_gpu.py \
+-cfg='./configs/swin_s_maskrcnn.yaml' \
+-dataset='coco' \
+-batch_size=8 \
+-data_path='/dataset/coco' \
+-eval \
+-pretrained='./weights/mask_rcnn_swin_small_patch4_window7'
diff --git a/object_detection/Swin/run_train.sh b/object_detection/Swin/run_train.sh
new file mode 100644
index 00000000..f2d9adff
--- /dev/null
+++ b/object_detection/Swin/run_train.sh
@@ -0,0 +1,9 @@
+CUDA_VISIBLE_DEVICES=7 \
+python main_single_gpu.py \
+-cfg='./configs/swin_s_maskrcnn.yaml' \
+-dataset='coco' \
+-batch_size=2 \
+-data_path='/dataset/coco' \
+-pretrained='./weights/mask_rcnn_swin_tiny_patch4_window7'
+#-pretrained='./weights/swin_small_patch4_window7_224'
+
diff --git a/object_detection/Swin/swin.png b/object_detection/Swin/swin.png
new file mode 100644
index 00000000..0a45ee47
Binary files /dev/null and b/object_detection/Swin/swin.png differ
diff --git a/object_detection/Swin/swin_backbone.py b/object_detection/Swin/swin_backbone.py
new file mode 100644
index 00000000..a112c4d5
--- /dev/null
+++ b/object_detection/Swin/swin_backbone.py
@@ -0,0 +1,677 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Swin Transformer backbone for object detection
+"""
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from model_utils import DropPath, _ntuple
+
+to_2tuple = _ntuple(2)
+
+
+class Identity(nn.Layer):
+    """ Identity layer
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+    """
+    def __init__(self):
+        super(Identity, self).__init__()
+    def forward(self, x):
+        return x
+
+
+class PatchEmbedding(nn.Layer):
+    """Patch Embeddings
+    Apply patch embeddings on input images. Embeddings is implemented using a Conv2D op.
+    Attributes:
+        patch_size: int, size of patch, default: 4
+        in_channels: int, input image channels, default: 3
+        embed_dim: int, embedding dimension, default: 96
+    """
+
+    def __init__(self, patch_size=4, in_channels=3, embed_dim=96):
+        super().__init__()
+        #image_size = (image_size, image_size) # TODO: add to_2tuple
+        patch_size = (patch_size, patch_size)
+        #patches_resolution = [image_size[0]//patch_size[0], image_size[1]//patch_size[1]]
+        #self.image_size = image_size
+        self.patch_size = patch_size
+        #self.patches_resolution = patches_resolution
+        #self.num_patches = patches_resolution[0] * patches_resolution[1]
+        self.in_channels = in_channels
+        self.embed_dim = embed_dim
+        self.patch_embed = nn.Conv2D(in_channels=in_channels,
+                                     out_channels=embed_dim,
+                                     kernel_size=patch_size,
+                                     stride=patch_size)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        # padding
+        _, _, H, W = x.shape
+        if W % self.patch_size[1] != 0:
+            x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))
+        if H % self.patch_size[0] != 0:
+            x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))
+
+        x = self.patch_embed(x) # [batch, embed_dim, h, w] h,w = patch_resolution
+        Wh, Ww = x.shape[2], x.shape[3]
+        x = x.flatten(start_axis=2, stop_axis=-1) # [batch, embed_dim, h*w] h*w = num_patches
+        x = x.transpose([0, 2, 1]) # [batch, h*w, embed_dim]
+        x = self.norm(x) # [batch, num_patches, embed_dim]
+        x = x.transpose([0, 2, 1])
+        x = x.reshape([-1, self.embed_dim, Wh, Ww])
+        return x
+
+
+class PatchMerging(nn.Layer):
+    """ Patch Merging class
+    Merge multiple patch into one path and keep the out dim.
+    Spefically, merge adjacent 2x2 patches(dim=C) into 1 patch.
+    The concat dim 4*C is rescaled to 2*C
+    Attributes:
+        input_resolution: tuple of ints, the size of input
+        dim: dimension of single patch
+        reduction: nn.Linear which maps 4C to 2C dim
+        norm: nn.LayerNorm, applied after linear layer.
+    """
+
+    def __init__(self, dim):
+        super(PatchMerging, self).__init__()
+        #self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4*dim, 2*dim, bias_attr=False)
+        self.norm = nn.LayerNorm(4*dim)
+
+    def forward(self, x, H, W):
+        #h, w = self.input_resolution
+        b, _, c = x.shape
+        x = x.reshape([b, H, W, c])
+
+        # padding
+        pad_input = (H % 2 == 1) or (W % 2 == 1)
+        if pad_input:
+            x = F.pad(x, (0, 0, 0, W % 2, H % 2))
+
+        x0 = x[:, 0::2, 0::2, :] # [B, H/2, W/2, C]
+        x1 = x[:, 1::2, 0::2, :] # [B, H/2, W/2, C]
+        x2 = x[:, 0::2, 1::2, :] # [B, H/2, W/2, C]
+        x3 = x[:, 1::2, 1::2, :] # [B, H/2, W/2, C]
+        x = paddle.concat([x0, x1, x2, x3], -1) #[B, H/2, W/2, 4*C]
+        x = x.reshape([b, -1, 4*c]) # [B, H/2*W/2, 4*C]
+
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+
+    def __init__(self, in_features, hidden_features, dropout):
+        super(Mlp, self).__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(in_features,
+                             hidden_features,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(hidden_features,
+                             in_features,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+class WindowAttention(nn.Layer):
+    """Window based multihead attention, with relative position bias.
+    Both shifted window and non-shifted window are supported.
+    Attributes:
+        dim: int, input dimension (channels)
+        window_size: int, height and width of the window
+        num_heads: int, number of attention heads
+        qkv_bias: bool, if True, enable learnable bias to q,k,v, default: True
+        qk_scale: float, override default qk scale head_dim**-0.5 if set, default: None
+        attention_dropout: float, dropout of attention
+        dropout: float, dropout for output
+    """
+
+    def __init__(self,
+                 dim,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attention_dropout=0.,
+                 dropout=0.):
+        super(WindowAttention, self).__init__()
+        self.window_size = window_size
+        self.num_heads = num_heads
+        self.dim = dim
+        self.dim_head = dim // num_heads
+        self.scale = qk_scale or self.dim_head ** -0.5
+
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=[(2 * window_size[0] -1) * (2 * window_size[1] - 1), num_heads],
+            dtype='float32',
+            default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))
+
+        # relative position index for each token inside window
+        coords_h = paddle.arange(self.window_size[0])
+        coords_w = paddle.arange(self.window_size[1])
+        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w])) # [2, window_h, window_w]
+        coords_flatten = paddle.flatten(coords, 1) # [2, window_h * window_w]
+        # 2, window_h * window_w, window_h * window_h
+        relative_coords = coords_flatten.unsqueeze(2) - coords_flatten.unsqueeze(1)
+        # winwod_h*window_w, window_h*window_w, 2
+        relative_coords = relative_coords.transpose([1, 2, 0])
+        relative_coords[:, :, 0] += self.window_size[0] - 1
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2* self.window_size[1] - 1
+        # [window_size * window_size, window_size*window_size]
+        relative_position_index = relative_coords.sum(-1)
+        self.register_buffer("relative_position_index", relative_position_index)
+
+        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        self.attn_dropout = nn.Dropout(attention_dropout)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_dropout = nn.Dropout(dropout)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def transpose_multihead(self, x):
+        new_shape = x.shape[:-1] + [self.num_heads, self.dim_head]
+        x = x.reshape(new_shape)
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def get_relative_pos_bias_from_pos_index(self):
+        # relative_position_bias_table is a ParamBase object
+        # https://github.com/PaddlePaddle/Paddle/blob/067f558c59b34dd6d8626aad73e9943cf7f5960f/python/paddle/fluid/framework.py#L5727
+        table = self.relative_position_bias_table # N x num_heads
+        # index is a tensor
+        index = self.relative_position_index.reshape([-1]) # window_h*window_w * window_h*window_w
+        # NOTE: paddle does NOT support indexing Tensor by a Tensor
+        relative_position_bias = paddle.index_select(x=table, index=index)
+        return relative_position_bias
+
+    def forward(self, x, mask=None):
+        qkv = self.qkv(x).chunk(3, axis=-1)
+        q, k, v = map(self.transpose_multihead, qkv)
+        q = q * self.scale
+        attn = paddle.matmul(q, k, transpose_y=True)
+
+        relative_position_bias = self.get_relative_pos_bias_from_pos_index()
+
+        relative_position_bias = relative_position_bias.reshape(
+            [self.window_size[0] * self.window_size[1],
+             self.window_size[0] * self.window_size[1],
+             -1])
+
+        # nH, window_h*window_w, window_h*window_w
+        relative_position_bias = relative_position_bias.transpose([2, 0, 1])
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.reshape(
+                [x.shape[0] // nW, nW, self.num_heads, x.shape[1], x.shape[1]])
+            attn += mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.reshape([-1, self.num_heads, x.shape[1], x.shape[1]])
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v)
+        z = z.transpose([0, 2, 1, 3])
+        new_shape = z.shape[:-2] + [self.dim]
+        z = z.reshape(new_shape)
+        z = self.proj(z)
+        z = self.proj_dropout(z)
+
+        return z
+
+
+def windows_partition(x, window_size):
+    """ partite windows into window_size x window_size
+    Args:
+        x: Tensor, shape=[b, h, w, c]
+        window_size: int, window size
+    Returns:
+        x: Tensor, shape=[num_windows*b, window_size, window_size, c]
+    """
+
+    B, H, W, C = x.shape
+    x = x.reshape([B, H//window_size, window_size, W//window_size, window_size, C])
+    x = x.transpose([0, 1, 3, 2, 4, 5])
+    x = x.reshape([-1, window_size, window_size, C]) #(num_windows*B, window_size, window_size, C)
+
+    return x
+
+
+def windows_reverse(windows, window_size, H, W):
+    """ Window reverse
+    Args:
+        windows: (n_windows * B, window_size, window_size, C)
+        window_size: (int) window size
+        H: (int) height of image
+        W: (int) width of image
+    Returns:
+        x: (B, H, W, C)
+    """
+
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.reshape([B, H // window_size, W // window_size, window_size, window_size, -1])
+    x = x.transpose([0, 1, 3, 2, 4, 5])
+    x = x.reshape([B, H, W, -1])
+    return x
+
+
+class SwinTransformerBlock(nn.Layer):
+    """Swin transformer block
+    Contains window multi head self attention, droppath, mlp, norm and residual.
+    Attributes:
+        dim: int, input dimension (channels)
+        num_heads: int, number of attention heads
+        windos_size: int, window size, default: 7
+        shift_size: int, shift size for SW-MSA, default: 0
+        mlp_ratio: float, ratio of mlp hidden dim and input embedding dim, default: 4.
+        qkv_bias: bool, if True, enable learnable bias to q,k,v, default: True
+        qk_scale: float, override default qk scale head_dim**-0.5 if set, default: None
+        dropout: float, dropout for output, default: 0.
+        attention_dropout: float, dropout of attention, default: 0.
+        droppath: float, drop path rate, default: 0.
+    """
+
+    def __init__(self, dim, num_heads, window_size=7, shift_size=0,
+                 mlp_ratio=4., qkv_bias=True, qk_scale=None, dropout=0.,
+                 attention_dropout=0., droppath=0.):
+        super(SwinTransformerBlock, self).__init__()
+        self.dim = dim
+        #self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.mlp_ratio = mlp_ratio
+        #if min(self.input_resolution) <= self.window_size:
+        #    self.shift_size = 0
+        #    self.window_size = min(self.input_resolution)
+
+        self.norm1 = nn.LayerNorm(dim)
+        self.attn = WindowAttention(dim,
+                                    window_size=(self.window_size, self.window_size),
+                                    num_heads=num_heads,
+                                    qkv_bias=qkv_bias,
+                                    qk_scale=qk_scale,
+                                    attention_dropout=attention_dropout,
+                                    dropout=dropout)
+        self.drop_path = DropPath(droppath) if droppath > 0. else None
+        self.norm2 = nn.LayerNorm(dim)
+        self.mlp = Mlp(in_features=dim,
+                       hidden_features=int(dim*mlp_ratio),
+                       dropout=dropout)
+
+        self.H = None
+        self.W = None
+        #if self.shift_size > 0:
+        #    H, W = self.input_resolution
+        #    img_mask = paddle.zeros((1, H, W, 1))
+        #    h_slices = (slice(0, -self.window_size),
+        #                slice(-self.window_size, -self.shift_size),
+        #                slice(-self.shift_size, None))
+        #    w_slices = (slice(0, -self.window_size),
+        #                slice(-self.window_size, -self.shift_size),
+        #                slice(-self.shift_size, None))
+        #    cnt = 0
+        #    for h in h_slices:
+        #        for w in w_slices:
+        #            img_mask[:, h, w, :] = cnt
+        #            cnt += 1
+
+        #    mask_windows = windows_partition(img_mask, self.window_size)
+        #    mask_windows = mask_windows.reshape((-1, self.window_size * self.window_size))
+        #    attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+        #    attn_mask = paddle.where(attn_mask != 0,
+        #                             paddle.ones_like(attn_mask) * float(-100.0),
+        #                             attn_mask)
+        #    attn_mask = paddle.where(attn_mask == 0,
+        #                             paddle.zeros_like(attn_mask),
+        #                             attn_mask)
+        #else:
+        #    attn_mask = None
+
+        #self.register_buffer("attn_mask", attn_mask)
+
+    def forward(self, x, mask_matrix):
+        #H, W = self.input_resolution
+        B, L, C = x.shape
+        H, W = self.H, self.W
+
+        h = x
+        x = self.norm1(x)
+
+        new_shape = [B, H, W, C]
+        x = x.reshape(new_shape)
+
+        # pad feature maps to multiples of winsow size
+        pad_l = pad_t = 0
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+
+        x = x.transpose([0, 3, 1, 2]) #[b,c,h,w]
+        x = F.pad(x, [pad_l, pad_r, pad_t, pad_b])
+        x = x.transpose([0, 2, 3, 1])
+
+        _, Hp, Wp, _ = x.shape
+
+
+        if self.shift_size > 0:
+            shifted_x = paddle.roll(x,
+                                    shifts=(-self.shift_size, -self.shift_size),
+                                    axis=(1, 2))
+            attn_mask = mask_matrix
+        else:
+            shifted_x = x
+            attn_mask = None
+
+        x_windows = windows_partition(shifted_x, self.window_size)
+        x_windows = x_windows.reshape([-1, self.window_size * self.window_size, C])
+
+        # merge windows
+        attn_windows = self.attn(x_windows, mask=attn_mask)
+        attn_windows = attn_windows.reshape([-1, self.window_size, self.window_size, C])
+
+        shifted_x = windows_reverse(attn_windows, self.window_size, Hp, Wp)
+
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = paddle.roll(shifted_x,
+                            shifts=(self.shift_size, self.shift_size),
+                            axis=(1, 2))
+        else:
+            x = shifted_x
+
+        if pad_r > 0 or pad_b > 0:
+            x = x[:, :H, :W, :]
+
+        x = x.reshape([B, H*W, C])
+
+        if self.drop_path is not None:
+            x = h + self.drop_path(x)
+        else:
+            x = h + x
+        h = x
+        x = self.norm2(x)
+        x = self.mlp(x)
+        if self.drop_path is not None:
+            x = h + self.drop_path(x)
+        else:
+            x = h + x
+
+        return x
+
+
+class SwinTransformerStage(nn.Layer):
+    """Stage layers for swin transformer
+    Stage layers contains a number of Transformer blocks and an optional
+    patch merging layer, patch merging is not applied after last stage
+    Attributes:
+        dim: int, embedding dimension
+        depth: list, num of blocks in each stage
+        blocks: nn.LayerList, contains SwinTransformerBlocks for one stage
+        downsample: PatchMerging, patch merging layer, none if last stage
+    """
+    def __init__(self, dim, depth, num_heads, window_size,
+                 mlp_ratio=4., qkv_bias=True, qk_scale=None, dropout=0.,
+                 attention_dropout=0., droppath=0., downsample=None):
+        super(SwinTransformerStage, self).__init__()
+        self.depth = depth
+        self.window_size = window_size
+        self.shift_size = window_size // 2
+
+        self.blocks = nn.LayerList()
+        for i in range(depth):
+            self.blocks.append(
+                SwinTransformerBlock(
+                    dim=dim,
+                    num_heads=num_heads,
+                    window_size=window_size,
+                    shift_size=0 if (i % 2 == 0) else window_size // 2,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    qk_scale=qk_scale,
+                    dropout=dropout,
+                    attention_dropout=attention_dropout,
+                    droppath=droppath[i] if isinstance(droppath, list) else droppath))
+
+        if downsample is not None:
+            self.downsample = downsample(dim=dim)
+        else:
+            self.downsample = None
+
+    def forward(self, x, H, W):
+        # calculate attention mask for SW-MSA
+        Hp = int(np.ceil(H / self.window_size)) * self.window_size
+        Wp = int(np.ceil(W / self.window_size)) * self.window_size
+        img_mask = paddle.zeros((1, Hp, Wp, 1))
+        h_slices = (slice(0, -self.window_size),
+                    slice(-self.window_size, -self.shift_size),
+                    slice(-self.shift_size, None))
+        w_slices = (slice(0, -self.window_size),
+                    slice(-self.window_size, -self.shift_size),
+                    slice(-self.shift_size, None))
+        cnt = 0
+        for h in h_slices:
+            for w in w_slices:
+                img_mask[:, h, w, :] = cnt
+                cnt += 1
+
+        mask_windows = windows_partition(img_mask, self.window_size)
+        mask_windows = mask_windows.reshape((-1, self.window_size * self.window_size))
+        attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+        attn_mask = paddle.where(attn_mask != 0,
+                                 paddle.ones_like(attn_mask) * float(-100.0),
+                                 attn_mask)
+        attn_mask = paddle.where(attn_mask == 0,
+                                 paddle.zeros_like(attn_mask),
+                                 attn_mask)
+        for block in self.blocks:
+            block.H, block.W = H, W
+            x = block(x, attn_mask)
+        if self.downsample is not None:
+            x_down = self.downsample(x, H, W)
+            Wh, Ww = (H + 1) // 2, (W + 1) // 2
+            return x, H, W, x_down, Wh, Ww
+        else:
+            return x, H, W, x, H, W
+
+        return x
+
+
+class SwinTransformer(nn.Layer):
+    """SwinTransformer class
+    Attributes:
+        num_classes: int, num of image classes
+        num_stages: int, num of stages contains patch merging and Swin blocks
+        depths: list of int, num of Swin blocks in each stage
+        num_heads: int, num of heads in attention module
+        embed_dim: int, output dimension of patch embedding
+        num_features: int, output dimension of whole network before classifier
+        mlp_ratio: float, hidden dimension of mlp layer is mlp_ratio * mlp input dim
+        qkv_bias: bool, if True, set qkv layers have bias enabled
+        qk_scale: float, scale factor for qk.
+        ape: bool, if True, set to use absolute positional embeddings
+        window_size: int, size of patch window for inputs
+        dropout: float, dropout rate for linear layer
+        dropout_attn: float, dropout rate for attention
+        patch_embedding: PatchEmbedding, patch embedding instance
+        patch_resolution: tuple, number of patches in row and column
+        position_dropout: nn.Dropout, dropout op for position embedding
+        stages: SwinTransformerStage, stage instances.
+        norm: nn.LayerNorm, norm layer applied after transformer
+        avgpool: nn.AveragePool2D, pooling layer before classifer
+        fc: nn.Linear, classifier op.
+    """
+    def __init__(self, config):
+        super(SwinTransformer, self).__init__()
+        pretrain_image_size = config.MODEL.TRANS.PRETRAIN_IMAGE_SIZE
+        patch_size = config.MODEL.TRANS.PATCH_SIZE
+        in_channels = config.MODEL.TRANS.IN_CHANNELS
+        embed_dim = config.MODEL.TRANS.EMBED_DIM
+        depths = config.MODEL.TRANS.STAGE_DEPTHS
+        num_heads = config.MODEL.TRANS.NUM_HEADS
+        window_size = config.MODEL.TRANS.WINDOW_SIZE
+        mlp_ratio = config.MODEL.TRANS.MLP_RATIO
+        qkv_bias = config.MODEL.TRANS.QKV_BIAS
+        qk_scale = config.MODEL.TRANS.QK_SCALE
+        dropout = config.MODEL.DROPOUT
+        attention_dropout = config.MODEL.ATTENTION_DROPOUT
+        droppath = config.MODEL.DROP_PATH
+        out_indices = config.MODEL.TRANS.OUT_INDICES
+
+        self.ape = config.MODEL.TRANS.APE
+        self.out_indices = out_indices
+        self.num_stages = len(depths)
+        self.frozen_stages = config.MODEL.TRANS.FROZEN_STAGES
+        self.patch_embedding = PatchEmbedding(patch_size=patch_size,
+                                              in_channels=in_channels,
+                                              embed_dim=embed_dim)
+
+        if self.ape:
+            pretrain_image_size = to_2tuple(pretrain_image_size)
+            patch_size = to_2tuple(patch_size)
+            patches_resolution = [pretrain_image_size[0] // patch_size[0], pretrain_image_size[1] // patch_size[1]]
+            self.absolute_positional_embedding = paddle.nn.ParameterList([
+                paddle.create_parameter(
+                    shape=[1, embed_dim, patches_resolution[0], patches_resolution[1]], dtype='float32',
+                    default_initializer=paddle.nn.initializer.TruncatedNormal(std=.02))])
+
+        self.position_dropout = nn.Dropout(dropout)
+
+        depth_decay = [x.item() for x in paddle.linspace(0, droppath, sum(depths))]
+
+        self.stages = nn.LayerList()
+        for stage_idx in range(self.num_stages):
+            stage = SwinTransformerStage(
+                dim=int(embed_dim * 2 ** stage_idx),
+                depth=depths[stage_idx],
+                num_heads=num_heads[stage_idx],
+                window_size=window_size,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                dropout=dropout,
+                attention_dropout=attention_dropout,
+                droppath=depth_decay[
+                    sum(depths[:stage_idx]):sum(depths[:stage_idx+1])],
+                downsample=PatchMerging if (
+                    stage_idx < self.num_stages-1) else None,
+                )
+            self.stages.append(stage)
+
+        #self.norm = nn.LayerNorm(self.num_features)
+        #self.avgpool = nn.AdaptiveAvgPool1D(1)
+        #self.fc = nn.Linear(self.num_features, self.num_classes)
+        num_features = [int(embed_dim * 2 ** i) for i in range(self.num_stages)]
+        self.num_features = num_features
+
+        # add norm layer for each output
+        for i_layer in out_indices:
+            layer = nn.LayerNorm(num_features[i_layer])
+            layer_name = f'norm{i_layer}'
+            self.add_sublayer(layer_name, layer)
+
+        self._freeze_stages()
+    
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embedding.eval()
+            for name, param in self.patch_embedding.named_parameters():
+                param.requires_grad = False
+        
+        if self.frozen_stages >= 1 and self.ape:
+            self.absolute_positional_embedding.requires_grad = False
+
+        if self.frozen_stages >= 2:
+            self.position_dropout.eval()
+            for i in range(0, self.frozen_stages - 1):
+                m = self.stages[i]
+                m.eval()
+                for name, param in m.named_parameters():
+                    param.requires_grad = False
+
+    def forward(self, x):
+        x = self.patch_embedding(x)
+        Wh, Ww = x.shape[2], x.shape[3]
+        if self.ape:
+            absolute_positional_embedding = F.interpolate(self.absolute_positional_embedding,
+                size=(Wh, Ww), mode='bicubic')
+            x = x + absolute_positional_embedding
+            x = x.flatten(2)
+            x = x.transpose([0, 2, 1])
+        else:
+            x = x.flatten(2)
+            x = x.transpose([0, 2, 1])
+
+        x = self.position_dropout(x)
+
+        outs = []
+        for i in range(self.num_stages):
+            stage = self.stages[i]
+            x_out, H, W, x, Wh, Ww = stage(x, Wh, Ww)
+
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                x_out = norm_layer(x_out)
+
+                out = x_out.reshape([-1, H, W, self.num_features[i]])
+                out = out.transpose([0, 3, 1, 2])
+                outs.append(out)
+
+        return tuple(outs)
+
+    def train(self, mode=True):
+        super(SwinTransformer, self).train(mode)
+        self._freeze_stages()
diff --git a/object_detection/Swin/swin_det.py b/object_detection/Swin/swin_det.py
new file mode 100644
index 00000000..26daa268
--- /dev/null
+++ b/object_detection/Swin/swin_det.py
@@ -0,0 +1,67 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Swin Transformer Object Detection"""
+
+import paddle
+import paddle.nn as nn
+from config import get_config
+from swin_backbone import SwinTransformer
+from det_necks.fpn import FPN, LastLevelMaxPool
+from det_heads.maskrcnn_head.rpn_head import RPNHead
+from det_heads.maskrcnn_head.roi_head import RoIHead
+
+cfg = get_config()
+
+class SwinTransformerDet(nn.Layer):
+    def __init__(self, config):
+        super(SwinTransformerDet, self).__init__()
+        self.backbone = SwinTransformer(config)
+        self.neck = FPN(
+            in_channels=config.FPN.IN_CHANNELS,
+            out_channel=config.FPN.OUT_CHANNELS,
+            strides=config.FPN.STRIDES,
+            use_c5=config.FPN.USE_C5,
+            top_block=LastLevelMaxPool()     
+        )
+        self.rpnhead = RPNHead(config)
+        self.roihead = RoIHead(config)
+
+        self.config = config
+    
+    def forward(self, x, gt=None):
+        feats = self.neck(self.backbone(x.tensors))
+        rpn_out = self.rpnhead(feats, gt)
+
+        if self.training and self.config.ROI.PAT_GT_AS_PRO:
+            proposals = []
+            for proposal, gt_box in zip(rpn_out[0], gt["gt_boxes"]):
+                proposals.append(paddle.concat([proposal, gt_box]))
+        else:
+            proposals = rpn_out[0]
+
+
+        final_out = self.roihead(feats, proposals, gt)
+
+        if self.training:
+            rpn_losses = rpn_out[2]
+            # if training, final_out returns losses, now we combine the losses dicts
+            final_out.update(rpn_losses)
+
+        return final_out
+
+
+def build_swin_det(config):
+    model = SwinTransformerDet(config)
+    return model
diff --git a/object_detection/Swin/transforms.py b/object_detection/Swin/transforms.py
new file mode 100644
index 00000000..7b6e5038
--- /dev/null
+++ b/object_detection/Swin/transforms.py
@@ -0,0 +1,376 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Transforms for image data and detection targets"""
+
+import random
+import numpy as np
+import PIL
+import paddle
+import paddle.vision.transforms as T
+from paddle.vision.transforms import functional as F
+from random_erasing import RandomErasing
+from box_ops import box_xyxy_to_cxcywh
+from box_ops import box_xyxy_to_cxcywh_numpy
+
+
+def crop(image, target, region):
+    cropped_image = T.crop(image, *region)
+    target = target.copy()
+    i, j, h, w = region
+    #target['size'] = paddle.to_tensor([h, w]).cpu()
+    target['size'] = np.array([h, w], dtype='float32')
+    fields = ['labels', 'area', 'iscrowd']
+
+    if 'boxes' in target:
+        boxes = target['boxes']
+        #max_size = paddle.to_tensor([h, w], dtype='float32').cpu()
+        max_size = np.array([h, w], dtype='float32')
+        #cropped_boxes = boxes - paddle.to_tensor([j, i, j, i], dtype='float32').cpu() # box are (x1, y1, x2, y2)
+        cropped_boxes = boxes - np.array([j, i, j, i], dtype='float32') # box are (x1, y1, x2, y2)
+        #cropped_boxes = paddle.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
+        cropped_boxes = np.minimum(cropped_boxes.reshape([-1, 2, 2]), max_size)
+        cropped_boxes = cropped_boxes.clip(min=0)
+        area = (cropped_boxes[:, 1, :] - cropped_boxes[:, 0, :]).prod(axis=1)
+        target['boxes'] = cropped_boxes.reshape([-1, 4])
+        target['area'] = area
+        fields.append('boxes')
+
+    if 'masks' in target:
+        target['masks'] = target['masks'][:, i:i + h, j:j + w]
+        fields.append('masks')
+
+
+    # remove the boxe or mask if the area is zero
+    if 'boxes' in target or 'masks' in target:
+        if 'boxes' in target:
+            cropped_boxes = target['boxes'].reshape((-1, 2, 2))
+            # FIXME: select indices where x2 > x1 and y2 > y1
+            # This paddle api will raise error in current env
+            #keep = paddle.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], axis=1)
+            # Instead we use numpy for temp fix
+            #cropped_boxes = cropped_boxes.cpu().numpy()
+            keep  = np.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], axis=1)
+            #keep = keep.cpu().numpy()
+        else:
+            keep = target['masks'].flatten(1).any(1)
+            #keep = keep.cpu().numpy()
+
+        keep_idx = np.where(keep)[0].astype('int32')
+        #keep = paddle.to_tensor(keep_idx).cpu()
+        keep = keep_idx
+
+        for field in fields:
+            #target[field] = target[field].index_select(keep, axis=0)
+            target[field] = target[field][keep]
+
+    return cropped_image, target
+
+
+def hflip(image, target):
+    flipped_image = T.hflip(image)
+    w, h = image.size
+    target = target.copy()
+    if 'boxes' in target:
+        boxes = target['boxes'] # n x 4
+        #boxes = boxes.index_select(paddle.to_tensor([2, 1, 0, 3], dtype='int32').cpu(), axis=1)
+        boxes = boxes[:, [2, 1, 0, 3]]
+        #boxes = boxes * paddle.to_tensor(
+        #        [-1, 1, -1, 1], dtype='float32').cpu() + paddle.to_tensor([w, 0, w, 0], dtype='float32').cpu()
+        boxes = boxes * np.array([-1, 1, -1, 1], dtype='float32') + np.array([w, 0, w, 0], dtype='float32')
+        target['boxes'] = boxes
+
+    if 'masks' in target:
+        target['masks'] = (target['masks']).flip(axis=[-1])
+
+    return flipped_image, target
+
+
+def resize(image, target, size, max_size=None):
+    def get_size_with_aspect_ratio(image_size, size, max_size=None):
+        """ get new image size for rescale, aspect ratio is kept, and longer side must < max_size
+        Args:
+            image_size: tuple/list of image width and height
+            size: length of shorter side of scaled image
+            max_size: max length of longer side of scaled image
+        Returns:
+            size: output image size in (h, w) order.
+        """
+        w, h = image_size
+        if max_size is not None:
+            min_original_size = float(min(w, h))
+            max_original_size = float(max(w, h))
+            # size is shorter side and keep the aspect ratio, if the longer side
+            # is larger than the max_size
+            if max_original_size / min_original_size * size > max_size:
+                # longer side is the max_size, shorter side size is:
+                size = int(round(max_size * min_original_size / max_original_size))
+        if (w <= h and w == size) or (h <= w and h == size):
+            return (h, w)
+
+        if w < h:
+            ow = size
+            oh = int(size * h / w)
+        else:
+            oh = size
+            ow = int(size * w / h)
+        
+        return (oh, ow)
+
+    def get_size(image_size, size, max_size=None):
+        """"get new image size to rescale
+        Args:
+            image_size: tuple, Pillow image size, (width, height)
+            size: int or list/tuple, if size is list or tuple, return
+            this size as the new image size to rescale, if size is a
+            single int, then compute the new image size by this size
+            (as shorter side) and max_size (as longer side), also keep
+            the same aspect_ratio as original image.
+            max_size: longest side max size of new image size
+        Return:
+            size: tuple, (width, height)
+        """
+        if isinstance(size, (list, tuple)):
+            return size[::-1]
+        else:
+            return get_size_with_aspect_ratio(image_size, size, max_size)
+
+    # STEP0: get new image size
+    size = get_size(image.size, size, max_size)
+    # STEP1: resize image with new size
+    rescaled_image = T.resize(image, size) # here size is (h, w)
+    # STEP2: resize targets
+    if target is None:
+        return rescaled_image, None
+
+    ratios = tuple(float(s) / float(s_orig) for s, s_orig in zip(rescaled_image.size, image.size))
+    ratio_width, ratio_height = ratios
+
+    target = target.copy()
+    if 'boxes' in target:
+        boxes = target['boxes']
+        if boxes.shape[0] == 0: # empty boxes
+            scaled_boxes = boxes
+        else: # this line works well in pytorch, but not in paddle
+            #scaled_boxes = boxes * paddle.to_tensor([ratio_width, ratio_height, ratio_width, ratio_height]).cpu()
+            scaled_boxes = boxes * np.array([ratio_width, ratio_height, ratio_width, ratio_height], dtype='float32')
+        target['boxes'] = scaled_boxes
+
+    if 'area' in target:
+        area = target['area']
+        scaled_area = area * (ratio_width * ratio_height)
+        target['area'] = scaled_area
+
+    h, w = size
+    #target['size'] = paddle.to_tensor([h, w]).cpu()
+    target['size'] = np.array([h, w], dtype='float32')
+
+    if 'masks' in target:
+        masks = target['masks'] # [N, H, W]
+        masks = masks.unsqueeze(-1).astype('float32') #[N, H, W, 1]
+        masks = paddle.to_tensor(masks).cpu()
+        masks = paddle.nn.functional.interpolate(
+                    masks, size, data_format='NHWC')  #[N, H', W', 1]
+        masks = masks[:, :, :, 0] > 0.5
+        masks = masks.astype('int32')
+        masks = masks.numpy()
+        target['masks'] = masks
+
+    return rescaled_image, target
+
+
+def pad(image, target, padding):
+    padded_image = T.pad(image, (0, 0, padding[0], padding[1]))
+    if target is None:
+        return padded_image, None
+    target = target.copy()
+    #target['size'] = paddle.to_tensor(padded_image.size[::-1]).cpu()
+    target['size'] = np.array(padded_image.size[::-1], dtype='float32')
+    if 'masks' in target:
+        target['masks'] = T.pad(target['masks'], (0, padding[0], 0, padding[1]))
+    return padded_image, target
+
+
+class RandomCrop():
+    def __init__(self, size):
+        self.size = size
+    
+    @staticmethod
+    def get_param(image, output_size):
+        def _get_image_size(img):
+            if F._is_pil_image(img):
+                return img.size
+            elif F._is_numpy_image(img):
+                return img.shape[:2][::-1]
+            elif F._is_tensor_image(img):
+                return img.shape[1:][::-1]  # chw
+            else:
+                raise TypeError("Unexpected type {}".format(type(img)))
+
+        w, h = _get_image_size(image)
+        th, tw = output_size
+        if w == tw and h == th:
+            return 0, 0, h, w
+
+        i = random.randint(0, h - th + 1)
+        j = random.randint(0, w - tw + 1)
+        return i, j, th, tw
+
+    def __call__(self, image, target):
+        region = RandomCrop.get_param(image, self.size)
+        return crop(image, target, region)
+
+
+class RandomSizeCrop():
+    def __init__(self, min_size, max_size):
+        self.min_size = min_size
+        self.max_size = max_size
+
+    def __call__(self, image, target):
+        w = random.randint(self.min_size, min(image.width, self.max_size))
+        h = random.randint(self.min_size, min(image.height, self.max_size))
+        region = RandomCrop.get_param(image, (h, w))
+        return crop(image, target, region)
+
+
+class CenterCrop():
+    def __init__(self, size):
+        self.size = size
+    
+    def __call__(self, image, target):
+        image_width, image_height = image.size
+        crop_height, crop_width = self.size
+        crop_top = int(round((image_height - crop_height) / 2.)) 
+        crop_left = int(round((image_width - crop_width) / 2.)) 
+        return crop(image, target, (crop_top, crop_left, crop_height, crop_width))
+
+
+class RandomHorizontalFlip():
+    def __init__(self, p=0.5):
+        self.p = p
+    
+    def __call__(self, image, target):
+        if random.random() < self.p:
+            return hflip(image, target)
+        return image, target
+
+
+class RandomResize():
+    def __init__(self, sizes, max_size=None):
+        assert isinstance(sizes, (list, tuple)) 
+        self.sizes = sizes
+        self.max_size = max_size
+
+    def __call__(self, image, target=None):
+        size = random.choice(self.sizes)
+        return resize(image, target, size, self.max_size)
+
+
+class RandomPad():
+    def __init__(self, max_pad):
+        self.max_pad = max_pad
+
+    def __call__(self, image, target):
+        pad_x = random.randint(0, self.max_pad)
+        pad_y = random.randint(0, self.max_pad)
+        return pad(image, target, (pad_x, pad_y))
+
+
+class RandomSelect():
+    """ Random select one the transforms to apply with probablity p"""
+    def __init__(self, transforms1, transforms2, p=0.5):
+        self.transforms1 = transforms1
+        self.transforms2 = transforms2
+        self.p = p
+        
+    def __call__(self, image, target):
+        if random.random() > self.p:
+            return self.transforms1(image, target)
+        return self.transforms2(image, target)
+
+
+class ToTensor():
+    def __call__(self, image, target):
+        return T.to_tensor(image), target
+
+
+class RandomErasing():
+    def __init__(self, *args, **kwargs):
+        self.eraser = RandomErasing(*args, **kwargs) 
+
+    def __call__(self, image, target):
+        return self.eraser(image), target
+
+
+class Normalize():
+    """Normalization for image and labels.
+
+    Specifically, image is normalized with -mean and /std,
+    boxes are converted to [cx, cy, w, h] format and scaled to 
+    [0, 1] according to image size
+    """
+
+    def __init__(self, mean, std, norm_gt=False):
+        self.mean = mean
+        self.std = std
+        self.norm_gt = norm_gt
+
+    def __call__(self, image, target=None):
+        image = T.functional.normalize(image, mean=self.mean, std=self.std)
+        if target is None:
+            return image, None
+
+        if not self.norm_gt:
+            return image, target
+
+        target = target.copy()
+        h, w = image.shape[-2:]
+        if 'boxes' in target and target['boxes'].shape[0] != 0:
+            boxes = target['boxes']
+            boxes = box_xyxy_to_cxcywh_numpy(boxes)
+            #boxes = boxes / paddle.to_tensor([w, h, w, h], dtype='float32').cpu()
+            boxes = boxes / np.array([w, h, w, h], dtype='float32')
+            target['boxes'] = boxes
+
+        return image, target
+
+
+class Compose():
+    def __init__(self, transforms):
+        self.transforms = transforms
+
+    def __call__(self, image, target):
+        for t in self.transforms:
+            image, target = t(image, target)
+        return image, target
+
+    def __repr__(self):
+        format_string = self.__class__.__name__ + "("
+        for t in self.transforms:
+            format_string += '\n'
+            format_string += '    {0}'.format(t)
+        format_string += '\n)'
+        return format_string
+
+
+
+        
+
+
+
+
+
+
+
+
diff --git a/object_detection/Swin/utils.py b/object_detection/Swin/utils.py
new file mode 100644
index 00000000..48d47ee8
--- /dev/null
+++ b/object_detection/Swin/utils.py
@@ -0,0 +1,225 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Utilities"""
+
+import copy
+import pickle
+import numpy as np
+import paddle
+import paddle.distributed as dist
+from paddle.optimizer.lr import LRScheduler
+
+
+class AverageMeter():
+    """ Meter for monitoring losses"""
+    def __init__(self):
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+        self.reset()
+
+    def reset(self):
+        """reset all values to zeros"""
+        self.avg = 0
+        self.sum = 0
+        self.cnt = 0
+
+    def update(self, val, n=1):
+        """update avg by val and n, where val is the avg of n values"""
+        self.sum += val * n
+        self.cnt += n
+        self.avg = self.sum / self.cnt
+
+
+def _max_by_axis(the_list):
+    maxes = the_list[0]
+    for sublist in the_list[1:]:
+        for idx, item in enumerate(sublist):
+            maxes[idx] = max(maxes[idx], item)
+    return maxes
+
+
+class NestedTensor():
+    """Each NestedTensor has .tensor and .mask attributes, which are paddle.Tensors"""
+    def __init__(self, tensors, mask):
+        self.tensors = tensors
+        self.mask = mask
+
+    def decompose(self):
+        return self.tensors, self.mask
+
+    def __repr__(self):
+        return str(self.tensors)
+
+
+def nested_tensor_from_tensor_list(tensor_list, size_divisibility):
+    """make the batch handle different image sizes
+    
+    This method take a list of tensors with different sizes,
+    then max size is selected as the final batch size,
+    smaller samples are padded with zeros(bottom-right),
+    and corresponding masks are generated.
+
+    """
+    max_size = _max_by_axis([list(img.shape) for img in tensor_list])
+
+    if size_divisibility > 1:
+        stride = size_divisibility
+        max_size[1] = (max_size[1] + (stride -1)) // stride * stride
+        max_size[2] = (max_size[2] + (stride -1)) // stride * stride
+
+    batch_shape = [len(tensor_list)] + max_size # len is the num of images in this batch
+    b, c, h, w  = batch_shape
+    dtype = tensor_list[0].dtype
+    data_tensor = paddle.zeros(batch_shape, dtype=dtype)
+    mask = paddle.ones((b, h, w), dtype='int32')
+    # zip has broadcast for tensor and mask
+    #print('===== inside nested_tensor_from_tensor_list')
+    # zip cannot used in paddle, which will create a new tensor. in pytorch it works well
+    #for img, pad_img, m in zip(tensor_list, tensor, mask):
+    #    pad_img[: img.shape[0], : img.shape[1], : img.shape[2]] = img
+    #    m[: img.shape[0], :img.shape[1]] = 0
+    for idx in range(b):
+        s0 = tensor_list[idx].shape[0]
+        s1 = tensor_list[idx].shape[1]
+        s2 = tensor_list[idx].shape[2]
+        # direct set value raise error in current env, we use numpy to bypass
+        #data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx].cpu().numpy()
+        data_tensor[idx, : s0, : s1, : s2] = tensor_list[idx]
+        mask[idx, : s1, : s2] = 0
+    return NestedTensor(data_tensor, mask)
+
+
+def reduce_dict(input_dict, average=True):
+    """Impl all_reduce for dict of tensors in DDP"""
+    world_size = dist.get_world_size()
+    if world_size < 2:
+        return input_dict
+    with paddle.no_grad():
+        names = []
+        values = []
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = paddle.stack(values, axis=0)
+        dist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+
+
+@paddle.no_grad()
+def accuracy(output, target, topk=(1,)):
+    if target.numel() == 0:
+        return [paddle.zeros([])]
+    maxk = max(topk)
+    batch_size = target.size(0)
+
+    _, pred = output.topk(maxk, 1, True, True)
+    pred = pred.t()
+    correct = pred.eq(target.reshape(1, -1).expand_as(pred))
+
+    res = []
+    for k in topk:
+        correct_k = correct[:k].reshape(-1).astype('float32').sum(0)
+        res.append(correct_k.mul_(100.0 / batch_size))
+    return res
+
+
+class WarmupCosineScheduler(LRScheduler):
+    """Warmup Cosine Scheduler
+
+    First apply linear warmup, then apply cosine decay schedule.
+    Linearly increase learning rate from "warmup_start_lr" to "start_lr" over "warmup_epochs"
+    Cosinely decrease learning rate from "start_lr" to "end_lr" over remaining
+    "total_epochs - warmup_epochs"
+
+    Attributes:
+        learning_rate: the starting learning rate (without warmup), not used here!
+        warmup_start_lr: warmup starting learning rate
+        start_lr: the starting learning rate (without warmup)
+        end_lr: the ending learning rate after whole loop
+        warmup_epochs: # of epochs for warmup
+        total_epochs: # of total epochs (include warmup)
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_start_lr,
+                 start_lr,
+                 end_lr,
+                 warmup_epochs,
+                 total_epochs,
+                 cycles=0.5,
+                 last_epoch=-1,
+                 verbose=False):
+        """init WarmupCosineScheduler """
+        self.warmup_epochs = warmup_epochs
+        self.total_epochs = total_epochs
+        self.warmup_start_lr = warmup_start_lr
+        self.start_lr = start_lr
+        self.end_lr = end_lr
+        self.cycles = cycles
+        super(WarmupCosineScheduler, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        """ return lr value """
+        if self.last_epoch < self.warmup_epochs:
+            val = (self.start_lr - self.warmup_start_lr) * float(
+                self.last_epoch)/float(self.warmup_epochs) + self.warmup_start_lr
+            return val
+
+        progress = float(self.last_epoch - self.warmup_epochs) / float(
+            max(1, self.total_epochs - self.warmup_epochs))
+        val = max(0.0, 0.5 * (1. + math.cos(math.pi * float(self.cycles) * 2.0 * progress)))
+        val = max(0.0, val * (self.start_lr - self.end_lr) + self.end_lr)
+        return val
+
+
+def all_gather(data):
+    """ run all_gather on any picklable data (do not requires tensors)
+    Args:
+        data: picklable object
+    Returns:
+        data_list: list of data gathered from each rank
+    """
+    world_size = dist.get_world_size()
+    if world_size == 1:
+        return [data]
+
+    buffer = pickle.dumps(data) #write data into Bytes and stores in buffer
+    np_buffer = np.frombuffer(buffer, dtype=np.int8)
+    tensor = paddle.to_tensor(np_buffer, dtype='int32') # uint8 doese not have many ops in paddle
+
+    # obtain Tensor size of each rank
+    local_size = paddle.to_tensor([tensor.shape[0]])
+    size_list = []
+    dist.all_gather(size_list, local_size)
+    max_size = max(size_list)
+
+    # receiving tensors from all ranks, 
+    # all_gather does not support different shape, so we use padding
+    tensor_list = []
+    if local_size != max_size:
+        padding = paddle.empty(shape=(max_size - local_size, ), dtype='int32')
+        tensor = paddle.concat((tensor, padding), axis=0)
+    dist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.astype('uint8').cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
diff --git a/object_detection/det_heads/det_utils/box_utils.py b/object_detection/det_heads/det_utils/box_utils.py
new file mode 100644
index 00000000..4d97829f
--- /dev/null
+++ b/object_detection/det_heads/det_utils/box_utils.py
@@ -0,0 +1,325 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+from paddle.fluid.framework import in_dygraph_mode
+from paddle.fluid import core
+from paddle.fluid.layer_helper import LayerHelper
+
+def bbox2delta(src_boxes, tgt_boxes, weights=[1.0, 1.0, 1.0, 1.0]):
+    '''
+    The function is used to compute two tensor boxes difference among (x, y, w, h).
+
+    Args:
+        src_boxes (tensor): shape [N, 4].
+        tgt_boxes (tensor): shape [N, 4].
+        weights (list[float]): balance the dx, dy, dw, dh.
+    
+    Returns:
+        deltas (tensor): shape[N, 4].
+    '''
+    src_w = src_boxes[:, 2] - src_boxes[:, 0]
+    src_h = src_boxes[:, 3] - src_boxes[:, 1]
+    src_ctr_x = src_boxes[:, 0] + 0.5 * src_w
+    src_ctr_y = src_boxes[:, 1] + 0.5 * src_h
+
+    tgt_w = tgt_boxes[:, 2] - tgt_boxes[:, 0]
+    tgt_h = tgt_boxes[:, 3] - tgt_boxes[:, 1]
+    tgt_ctr_x = tgt_boxes[:, 0] + 0.5 * tgt_w
+    tgt_ctr_y = tgt_boxes[:, 1] + 0.5 * tgt_h
+
+    wx, wy, ww, wh = weights
+    dx = wx * (tgt_ctr_x - src_ctr_x) / src_w
+    dy = wy * (tgt_ctr_y - src_ctr_y) / src_h
+    dw = ww * paddle.log(tgt_w / src_w)
+    dh = wh * paddle.log(tgt_h / src_h)
+
+    deltas = paddle.stack((dx, dy, dw, dh), axis=1)
+    return deltas
+
+
+def delta2bbox(deltas, boxes, weights=[1.0, 1.0, 1.0, 1.0]):
+    '''
+    The inverse process of bbox2delta.
+    '''
+    clip_scale = math.log(1000.0 / 16)
+
+    widths = boxes[:, 2] - boxes[:, 0]
+    heights = boxes[:, 3] - boxes[:, 1]
+    ctr_x = boxes[:, 0] + 0.5 * widths
+    ctr_y = boxes[:, 1] + 0.5 * heights
+
+    wx, wy, ww, wh = weights
+    dx = deltas[:, 0::4] / wx
+    dy = deltas[:, 1::4] / wy
+    dw = deltas[:, 2::4] / ww
+    dh = deltas[:, 3::4] / wh
+    # Prevent sending too large values into paddle.exp()
+    dw = paddle.clip(dw, max=clip_scale)
+    dh = paddle.clip(dh, max=clip_scale)
+
+    pred_ctr_x = dx * widths.unsqueeze(1) + ctr_x.unsqueeze(1)
+    pred_ctr_y = dy * heights.unsqueeze(1) + ctr_y.unsqueeze(1)
+    pred_w = paddle.exp(dw) * widths.unsqueeze(1)
+    pred_h = paddle.exp(dh) * heights.unsqueeze(1)
+
+    pred_boxes = []
+    pred_boxes.append(pred_ctr_x - 0.5 * pred_w)
+    pred_boxes.append(pred_ctr_y - 0.5 * pred_h)
+    pred_boxes.append(pred_ctr_x + 0.5 * pred_w)
+    pred_boxes.append(pred_ctr_y + 0.5 * pred_h)
+    pred_boxes = paddle.stack(pred_boxes, axis=-1)
+
+    return pred_boxes
+
+
+def boxes_area(boxes):
+    '''
+    Compute boxes area.
+
+    Args:
+        boxes (tensor):  shape [M, 4] | [N, M, 4].
+
+    Returns:
+        areas (tensor): shape [M] | [N, M].
+    '''
+    assert boxes.shape[-1] == 4
+    if boxes.dim() == 2:
+        boxes_wh = boxes[:, 2:] - boxes[:, :2]
+        return (boxes_wh[:, 0] * boxes_wh[:, 1]).clip(min=0)
+
+    elif boxes.dim() == 3:
+        boxes_wh = boxes[:, :, 2:] - boxes[:, :, :2]
+        return (boxes_wh[:, :, 0] * boxes_wh[:, :, 1]).clip(min=0)
+
+    else:
+        raise ValueError("The dim of boxes must be 2 or 3!")
+
+
+def boxes_iou(boxes1, boxes2, mode='a'):
+    '''
+    Compute the ious of two boxes tensor and the coordinate format of boxes is xyxy.
+
+    Args:
+        boxes1 (tensor): when mode == 'a': shape [M, 4];  when mode == 'b': shape [M, 4]
+        boxes2 (tensor): when mode == 'a': shape [R, 4];  when mode == 'b': shape [M, 4]
+        mode (string | 'a' or 'b'): when mode == 'a': compute one to many;
+                                    when mode == 'b': compute one to one.
+
+    Returns:
+        ious (tensor): when mode == 'a': shape [M, R];  when mode == 'b': shape [M]
+    '''
+    area1 = boxes_area(boxes1)
+    area2 = boxes_area(boxes2)
+
+    if mode == 'a':
+        lt = paddle.maximum(boxes1.unsqueeze(-2)[:, :, :2], boxes2.unsqueeze(0)[:, :, :2])
+        rb = paddle.minimum(boxes1.unsqueeze(-2)[:, :, 2:], boxes2.unsqueeze(0)[:, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1]
+
+        union_area = area1.unsqueeze(-1) + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+
+    elif mode == 'b':
+        assert boxes1.shape[0] == boxes2.shape[0]
+
+        lt = paddle.maximum(boxes1[:, :2], boxes2[:, :2])
+        rb = paddle.minimum(boxes1[:, 2:], boxes2[:, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, 0] * inter_wh[:, 1]
+
+        union_area = area1 + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+        
+    else:
+        raise ValueError("Only support mode 'a' or 'b'")
+
+
+def batch_iou(boxes1, boxes2, mode='a'):
+    '''
+    Compute the ious of two boxes tensor and the coordinate format of boxes is xyxy.
+
+    Args:
+        boxes1 (tensor): when mode == 'a': shape [N, M, 4];  when mode == 'b': shape [N, M, 4]
+        boxes2 (tensor): when mode == 'a': shape [N, R, 4];  when mode == 'b': shape [N, M, 4]
+        mode (string | 'a' or 'b'): when mode == 'a': compute one to many;
+        when mode == 'b': compute one to one
+
+    Returns:
+        ious (tensor): when mode == 'a': shape [N, M, R];  when mode == 'b': shape [N, M]
+    '''
+    area1 = boxes_area(boxes1)
+    area2 = boxes_area(boxes2)
+
+    if mode == 'a':
+        lt = paddle.maximum(boxes1.unsqueeze(-2)[:, :, :, :2], boxes2.unsqueeze(1)[:, :, :, :2])
+        rb = paddle.minimum(boxes1.unsqueeze(-2)[:, :, :, 2:], boxes2.unsqueeze(1)[:, :, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, :, 0] * inter_wh[:, :, :, 1]
+
+        union_area = area1.unsqueeze(-1) + area2.unsqueeze(-2) - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+
+    elif mode == 'b':
+        assert boxes1.shape[0] == boxes2.shape[0]
+
+        lt = paddle.maximum(boxes1[:, :, :2], boxes2[:, :, :2])
+        rb = paddle.minimum(boxes1[:, :, 2:], boxes2[:, :, 2:])
+
+        inter_wh = (rb - lt).astype("float32").clip(min=0)
+        inter_area = inter_wh[:, :, 0] * inter_wh[:, :, 1]
+
+        union_area = area1 + area2 - inter_area + 1e-6
+
+        ious = paddle.where(inter_area > 0,
+                            inter_area / union_area,
+                            paddle.zeros_like(inter_area, dtype="float32"))
+
+        return ious, union_area
+    else:
+        raise ValueError("Only support mode 'a' or 'b'")
+
+
+def nonempty_bbox(boxes, min_size=0, return_mask=False):
+    w = boxes[:, 2] - boxes[:, 0]
+    h = boxes[:, 3] - boxes[:, 1]
+    mask = paddle.logical_and(h > min_size, w > min_size)
+    if return_mask:
+        return mask
+    keep = paddle.nonzero(mask).flatten()
+    return keep
+
+
+def multiclass_nms(bboxes,
+                   scores,
+                   score_threshold,
+                   keep_top_k,
+                   nms_top_k=-1,
+                   nms_threshold=0.3,
+                   normalized=True,
+                   nms_eta=1.,
+                   background_label=-1,
+                   return_index=False,
+                   return_rois_num=True,
+                   rois_num=None,
+                   name=None):
+    """
+    This operator is to do multi-class non maximum suppression (NMS) on
+    boxes and scores.
+    In the NMS step, this operator greedily selects a subset of detection bounding
+    boxes that have high scores larger than score_threshold, if providing this
+    threshold, then selects the largest nms_top_k confidences scores if nms_top_k
+    is larger than -1. Then this operator pruns away boxes that have high IOU
+    (intersection over union) overlap with already selected boxes by adaptive
+    threshold NMS based on parameters of nms_threshold and nms_eta.
+    Aftern NMS step, at most keep_top_k number of total bboxes are to be kept
+    per image if keep_top_k is larger than -1.
+    Args:
+        bboxes (tensor): Two types of bboxes are supported:
+                           1. (tensor) A 3-D Tensor with shape
+                           [N, M, 4 or 8 16 24 32] represents the
+                           predicted locations of M bounding bboxes,
+                           N is the batch size. Each bounding box has four
+                           coordinate values and the layout is
+                           [xmin, ymin, xmax, ymax], when box size equals to 4.
+                           2. (tensor) A 3-D Tensor with shape [M, C, 4]
+                           M is the number of bounding boxes, C is the
+                           class number
+        scores (tensor): Two types of scores are supported:
+                           1. (tensor) A 3-D Tensor with shape [N, C, M]
+                           represents the predicted confidence predictions.
+                           N is the batch size, C is the class number, M is
+                           number of bounding boxes. For each category there
+                           are total M scores which corresponding M bounding
+                           boxes. Please note, M is equal to the 2nd dimension
+                           of BBoxes.
+                           2. (LoDTensor) A 2-D LoDTensor with shape [M, C].
+                           M is the number of bbox, C is the class number.
+                           In this case, input BBoxes should be the second
+                           case with shape [M, C, 4].
+        background_label (int): The index of background label, the background
+                                label will be ignored. If set to -1, then all
+                                categories will be considered. Default: 0
+        score_threshold (float): Threshold to filter out bounding boxes with
+                                 low confidence score. If not provided,
+                                 consider all boxes.
+        nms_top_k (int): Maximum number of detections to be kept according to
+                         the confidences after the filtering detections based
+                         on score_threshold.
+        nms_threshold (float): The threshold to be used in NMS. Default: 0.3
+        nms_eta (float): The threshold to be used in NMS. Default: 1.0
+        keep_top_k (int): Number of total bboxes to be kept per image after NMS
+                          step. -1 means keeping all bboxes after NMS step.
+        normalized (bool): Whether detections are normalized. Default: True
+        return_index(bool): Whether return selected index. Default: False
+        rois_num(Tensor): 1-D Tensor contains the number of RoIs in each image. 
+            The shape is [B] and data type is int32. B is the number of images.
+            If it is not None then return a list of 1-D Tensor. Each element 
+            is the output RoIs' number of each image on the corresponding level
+            and the shape is [B]. None by default.
+        name(str): Name of the multiclass nms op. Default: None.
+
+    Returns:
+        A tuple with two Variables: (Out, Index) if return_index is True,
+        otherwise, a tuple with one Variable(Out) is returned.
+        Out: A 2-D LoDTensor with shape [No, 6] represents the detections.
+        Each row has 6 values: [label, confidence, xmin, ymin, xmax, ymax]
+        or A 2-D LoDTensor with shape [No, 10] represents the detections.
+        Each row has 10 values: [label, confidence, x1, y1, x2, y2, x3, y3,
+        x4, y4]. No is the total number of detections.
+        If all images have not detected results, all elements in LoD will be
+        0, and output tensor is empty (None).
+        Index: Only return when return_index is True. A 2-D LoDTensor with
+        shape [No, 1] represents the selected index which type is Integer.
+        The index is the absolute value cross batches. No is the same number
+        as Out. If the index is used to gather other attribute such as age,
+        one needs to reshape the input(N, M, 1) to (N * M, 1) as first, where
+        N is the batch size and M is the number of boxes.
+    """
+    helper = LayerHelper('multiclass_nms3', **locals())
+
+    if in_dygraph_mode():
+        attrs = ('background_label', background_label, 'score_threshold',
+                 score_threshold, 'nms_top_k', nms_top_k, 'nms_threshold',
+                 nms_threshold, 'keep_top_k', keep_top_k, 'nms_eta', nms_eta,
+                 'normalized', normalized)
+
+        output, index, nms_rois_num = core.ops.multiclass_nms3(bboxes, scores,
+                                                               rois_num, *attrs)
+        if not return_index:
+            index = None
+
+        return output, nms_rois_num, index
\ No newline at end of file
diff --git a/object_detection/det_heads/det_utils/generator_utils.py b/object_detection/det_heads/det_utils/generator_utils.py
new file mode 100644
index 00000000..092c620a
--- /dev/null
+++ b/object_detection/det_heads/det_utils/generator_utils.py
@@ -0,0 +1,500 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+import paddle.nn as nn
+from paddle.fluid.framework import Variable, in_dygraph_mode
+from paddle.fluid import core
+
+class AnchorGenerator(nn.Layer):
+    """
+    Compute anchors in the standard ways described in
+    "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks".
+
+    Attributes:
+        anchor_size (list[list[float]] | list[float]):
+            If ``anchor_size`` is list[list[float]], ``anchor_size[i]`` is the list of anchor sizes
+            (i.e. sqrt of anchor area) to use for the i-th feature map.
+            If ``anchor_size`` is list[float], ``anchor_size`` is used for all feature maps.
+            Anchor anchor_size are given in absolute lengths in units of
+            the input image; they do not dynamically scale if the input image size changes.
+        aspect_ratios (list[list[float]] or list[float]): list of aspect ratios
+            (i.e. height / width) to use for anchors. Same "broadcast" rule for `sizes` applies.
+        strides (list[int]): stride of each input feature.
+        offset (float): Relative offset between the center of the first anchor and the top-left
+            corner of the image. Value has to be in [0, 1).
+            Recommend to use 0.5, which means half stride.
+    """
+
+    def __init__(self, 
+                 anchor_sizes = [[32], [64], [128], [256], [512]],
+                 aspect_ratios = [0.5, 1.0, 2.0],
+                 strides = [4, 8, 16, 32, 64],
+                 offset = 0.5):
+        super(AnchorGenerator, self).__init__()
+
+        self.anchor_sizes = anchor_sizes
+        self.aspect_ratios = aspect_ratios
+        self.strides = strides
+        self.offset = offset
+        self.base_anchors = self._compute_anchors()
+
+        assert 0. <= self.offset <= 1.0
+
+    def generate_anchors(self, 
+                        sizes = [32, 64, 128, 256, 512], 
+                        aspect_ratios = [0.5, 1.0, 2.0]):
+        """
+        Generate a tensor storing canonical anchor boxes, which are all anchor
+        boxes of different sizes and aspect_ratios centered at (0, 0).
+        We can later build the set of anchors for a full feature map by
+        shifting and tiling these tensors (see `meth:_grid_anchors`).
+        Args:
+            sizes (list[float] | tuple[float]):
+            aspect_ratios (list[float] | tuple[float]]):
+        Returns:
+            Tensor of shape (len(sizes) * len(aspect_ratios), 4) storing anchor boxes
+                in xyxy format.
+        """
+        anchors = []
+        
+        for size in sizes:
+            area = size ** 2.0
+            for ratio in aspect_ratios:
+                w = math.sqrt(area / ratio)
+                h = ratio * w
+                x0, y0, x1, y1 = -w / 2.0, -h / 2.0, w / 2.0, h / 2.0
+                anchors.append([x0, y0, x1, y1])
+        
+        return paddle.to_tensor(anchors, dtype='float32')
+    
+    def _broadcast_params(self, params, num_features):
+        if not isinstance(params[0], (list, tuple)):
+            return [params] * num_features
+        if len(params) == 1:
+            return params * num_features
+        return params
+        
+    def _compute_anchors(self):
+        sizes = self._broadcast_params(self.anchor_sizes, len(self.strides))
+        aspect_ratios = self._broadcast_params(self.aspect_ratios, len(self.strides))
+
+        base_anchors = [self.generate_anchors(s, a) for s, a in zip(sizes, aspect_ratios)]
+
+        [self.register_buffer(t.name, t, persistable=False) for t in base_anchors]
+
+        return base_anchors
+
+    def _grid_anchors(self, grid_sizes):
+        anchors = []
+
+        for grid_size, stride, base_anchor in zip(grid_sizes, self.strides, self.base_anchors):
+            grid_h, grid_w = grid_size
+
+            grid_x = paddle.arange(
+                self.offset * stride, grid_w * stride, step = stride, dtype='float32'
+            )
+            grid_y = paddle.arange(
+                self.offset * stride, grid_h * stride, step = stride, dtype='float32'
+            )
+
+            grid_y, grid_x = paddle.meshgrid(grid_y, grid_x)
+            grid_x = grid_x.reshape([-1])
+            grid_y = grid_y.reshape([-1])
+
+            grid_coord = paddle.stack([grid_x, grid_y, grid_x, grid_y], axis=1)
+
+            anchors.append((grid_coord.unsqueeze(1) + base_anchor.unsqueeze(0)).reshape([-1, 4]))
+
+        return anchors
+    
+    def forward(self, feats):
+        grid_sizes = [feat.shape[-2:] for feat in feats]
+        anchor_over_all_feat_maps = self._grid_anchors(grid_sizes)
+
+        return anchor_over_all_feat_maps
+    
+    @property
+    def num_anchors(self):
+        return [len(num_a) for num_a in self.base_anchors][0]
+
+# feats = []
+# h, w = 800., 800
+# for i in range(4):
+#     feats.append(paddle.rand([4, 256, h / (2 ** (i + 2)), w / (2 ** (i + 2))]))
+
+# anchorgenerator = AnchorGenerator()
+# res = anchorgenerator(feats)
+# print(anchorgenerator.num_anchors)
+# print(res)
+def generate_proposals(scores,
+                       bbox_deltas,
+                       im_shape,
+                       anchors,
+                       variances,
+                       pre_nms_top_n=6000,
+                       post_nms_top_n=1000,
+                       nms_thresh=0.5,
+                       min_size=0.1,
+                       eta=1.0,
+                       pixel_offset=False,
+                       return_rois_num=False,
+                       name=None):
+    """
+    **Generate proposal Faster-RCNN**
+    This operation proposes RoIs according to each box with their
+    probability to be a foreground object and 
+    the box can be calculated by anchors. Bbox_deltais and scores
+    to be an object are the output of RPN. Final proposals
+    could be used to train detection net.
+    For generating proposals, this operation performs following steps:
+    1. Transposes and resizes scores and bbox_deltas in size of
+       (H*W*A, 1) and (H*W*A, 4)
+    2. Calculate box locations as proposals candidates. 
+    3. Clip boxes to image
+    4. Remove predicted boxes with small area. 
+    5. Apply NMS to get final proposals as output.
+
+    Args:
+        scores (tensor): A 4-D Tensor with shape [N, A, H, W] represents
+            the probability for each box to be an object.
+            N is batch size, A is number of anchors, H and W are height and
+            width of the feature map. The data type must be float32.
+        bbox_deltas (tensor): A 4-D Tensor with shape [N, 4*A, H, W]
+            represents the difference between predicted box location and
+            anchor location. The data type must be float32.
+        im_shape (tensor): A 2-D Tensor with shape [N, 2] represents H, W, the
+            origin image size or input size. The data type can be float32 or 
+            float64.
+        anchors (tensor): A 4-D Tensor represents the anchors with a layout
+            of [H, W, A, 4] or [H * W * A, 4]. H and W are height and width of the feature map,
+            num_anchors is the box count of each position. Each anchor is
+            in (xmin, ymin, xmax, ymax) format an unnormalized. The data type must be float32.
+        variances (tensor): A 4-D Tensor. The expanded variances of anchors with a layout of
+            [H, W, num_priors, 4]. Each variance is in (xcenter, ycenter, w, h) format. 
+            The data type must be float32.
+        pre_nms_top_n (float): Number of total bboxes to be kept per image before NMS. 
+            The data type must be float32. `6000` by default.
+        post_nms_top_n (float): Number of total bboxes to be kept per image after NMS. The data type must be float32. 
+            `1000` by default.
+        nms_thresh (float): Threshold in NMS. The data type must be float32. `0.5` by default.
+        min_size (float): Remove predicted boxes with either height or
+            width < min_size. The data type must be float32. `0.1` by default.
+        eta (float): Apply in adaptive NMS, if adaptive `threshold > 0.5`,
+            `adaptive_threshold = adaptive_threshold * eta` in each iteration.
+        return_rois_num (bool): When setting True, it will return a 1D Tensor with shape [N, ] that includes Rois's 
+            num of each image in one batch. The N is the image's num. For example, the tensor has values [4,5] that represents
+            the first image has 4 Rois, the second image has 5 Rois. It only used in rcnn model. 
+            'False' by default. 
+        name(str, optional): For detailed information, please refer 
+            to :ref:`api_guide_Name`. Usually name is no need to set and 
+            None by default. 
+    Returns:
+        tuple:
+        A tuple with format ``(rpn_rois, rpn_roi_probs)``.
+        - **rpn_rois**: The generated RoIs. 2-D Tensor with shape ``[N, 4]`` while ``N`` is the number of RoIs. 
+            The data type is the same as ``scores``.
+        - **rpn_roi_probs**: The scores of generated RoIs. 2-D Tensor with shape ``[N, 1]`` while ``N`` is the number of RoIs. 
+            The data type is the same as ``scores``.
+    """
+    assert in_dygraph_mode()
+    assert return_rois_num, "return_rois_num should be True in dygraph mode."
+    attrs = ('pre_nms_topN', pre_nms_top_n, 'post_nms_topN', post_nms_top_n,
+                'nms_thresh', nms_thresh, 'min_size', min_size, 'eta', eta,
+                'pixel_offset', pixel_offset)
+    rpn_rois, rpn_roi_probs, rpn_rois_num = core.ops.generate_proposals_v2(
+        scores, bbox_deltas, im_shape, anchors, variances, *attrs)
+
+    return rpn_rois, rpn_roi_probs, rpn_rois_num
+
+
+class ProposalGenerator(object):
+    """
+    For each feature map, select the `pre_nms_topk` highest scoring proposals,
+    apply NMS, clip proposals, and remove small boxes. Return the `post_nms_topk`
+    highest scoring proposals among all the feature maps for each image.
+
+    Attributes:
+        pre_nms_top_n (int): number of top k scoring proposals to keep before applying NMS.
+            When RPN is run on multiple feature maps (as in FPN) this number is per
+            feature map.Default 6000
+        post_nms_top_n (int): number of top k scoring proposals to keep after applying NMS.
+            When RPN is run on multiple feature maps (as in FPN) this number is total,
+            over all feature maps.Default 1000
+        nms_thresh (float): Threshold in NMS. default 0.5
+        min_size (float): minimum proposal box side length in pixels (absolute units
+            wrt input images).
+        eta (float): Apply in adaptive NMS, if adaptive `threshold > 0.5`,
+             `adaptive_threshold = adaptive_threshold * eta` in each iteration.
+             default 1.
+        topk_after_collect (bool): whether to adopt topk after batch 
+             collection. If topk_after_collect is true, box filter will not be 
+             used after NMS at each image in proposal generation. default false
+    """
+
+    def __init__(self,
+                 pre_nms_top_n = 6000,
+                 post_nms_top_n = 1000,
+                 nms_thresh = .5,
+                 min_size = .1,
+                 eta = 1.,
+                 topk_after_collect = False):
+        super(ProposalGenerator, self).__init__()
+        self.pre_nms_top_n = pre_nms_top_n
+        self.post_nms_top_n = post_nms_top_n
+        self.nms_thresh = nms_thresh
+        self.min_size = min_size
+        self.eta = eta
+        self.topk_after_collect = topk_after_collect
+
+    def __call__(self, scores, bbox_deltas, anchors, imgs_shape):
+        top_n = self.pre_nms_top_n if self.topk_after_collect else self.post_nms_top_n
+        variances = paddle.ones_like(anchors)
+        rpn_rois, rpn_rois_prob, rpn_rois_num = generate_proposals(
+            scores,
+            bbox_deltas,
+            imgs_shape,
+            anchors,
+            variances,
+            pre_nms_top_n=self.pre_nms_top_n,
+            post_nms_top_n=top_n,
+            nms_thresh=self.nms_thresh,
+            min_size=self.min_size,
+            eta=self.eta,
+            return_rois_num=True
+        )
+
+        return rpn_rois, rpn_rois_prob, rpn_rois_num, self.post_nms_top_n  
+
+
+def roi_align(input,
+              rois,
+              output_size,
+              spatial_scale=1.0,
+              sampling_ratio=-1,
+              rois_num=None,
+              aligned=True):
+    """
+    Region of interest align (also known as RoI align) is to perform
+    bilinear interpolation on inputs of nonuniform sizes to obtain 
+    fixed-size feature maps (e.g. 7*7).
+
+    Args:
+        input (Tensor): Input feature, 4D-Tensor with the shape of [N,C,H,W], 
+            where N is the batch size, C is the input channel, H is Height, W is weight. 
+            The data type is float32 or float64.
+        rois (Tensor): ROIs (Regions of Interest) to pool over.It should be
+            a 2-D Tensor or 2-D LoDTensor of shape (num_rois, 4), the lod level is 1. 
+            The data type is float32 or float64. Given as [[x1, y1, x2, y2], ...],
+            (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates.
+        output_size (list[int, int] | tuple[int, int]): The pooled output size(h, w), data type is int32.
+        spatial_scale (list[float32], optional): Multiplicative spatial scale factor to translate ROI coords 
+            from their input scale to the scale used when pooling. Default: 1.0
+        sampling_ratio(int32, optional): number of sampling points in the interpolation grid. 
+            If <=0, then grid points are adaptive to roi_width and pooled_w, likewise for height. Default: -1
+        rois_num (Tensor): The number of RoIs in each image. Default: None
+        name(str, optional): For detailed information, please refer
+            to :ref:`api_guide_Name`. Usually name is no need to set and
+            None by default.
+
+    Returns:
+        Tensor:
+        Output: The output of ROIAlignOp is a 4-D tensor with shape (num_rois, channels, pooled_h, pooled_w).
+            The data type is float32 or float64.
+    """
+
+    if isinstance(output_size, int):
+        output_size = (output_size, output_size)
+
+    pooled_height, pooled_width = output_size
+
+    if in_dygraph_mode():
+        assert rois_num is not None, "rois_num should not be None in dygraph mode."
+        align_out = core.ops.roi_align(
+            input, rois, rois_num, "pooled_height", pooled_height,
+            "pooled_width", pooled_width, "spatial_scale", spatial_scale,
+            "sampling_ratio", sampling_ratio, "aligned", aligned)
+
+        return align_out
+
+
+def distribute_fpn_proposals(fpn_rois,
+                             min_level,
+                             max_level,
+                             refer_level,
+                             refer_scale,
+                             pixel_offset=False,
+                             rois_num=None):
+    """
+    
+    **This op only takes LoDTensor as input.** In Feature Pyramid Networks 
+    (FPN) models, it is needed to distribute all proposals into different FPN 
+    level, with respect to scale of the proposals, the referring scale and the 
+    referring level. Besides, to restore the order of proposals, we return an 
+    array which indicates the original index of rois in current proposals. 
+
+    Args:
+        fpn_rois(tensor): 2-D Tensor with shape [N, 4] and data type is 
+            float32 or float64. The input fpn_rois.
+        min_level(int32): The lowest level of FPN layer where the proposals come 
+            from.
+        max_level(int32): The highest level of FPN layer where the proposals
+            come from.
+        refer_level(int32): The referring level of FPN layer with specified scale.
+        refer_scale(int32): The referring scale of FPN layer with specified level.
+        rois_num(tensor): 1-D Tensor contains the number of RoIs in each image. 
+            The shape is [B] and data type is int32. B is the number of images.
+            If it is not None then return a list of 1-D Tensor. Each element 
+            is the output RoIs' number of each image on the corresponding level
+            and the shape is [B]. None by default. 
+
+    Returns:
+        Tuple:
+        multi_rois(list[tensor]) : A list of 2-D LoDTensor with shape [M, 4] 
+        and data type of float32 and float64. The length is 
+        max_level-min_level+1. The proposals in each FPN level.
+        restore_ind(tensor): A 2-D Tensor with shape [N, 1], N is 
+        the number of total rois. The data type is int32. It is
+        used to restore the order of fpn_rois.
+        rois_num_per_level(list(tensor)): A list of 1-D Tensor and each Tensor is 
+        the RoIs' number in each image on the corresponding level. The shape 
+        is [B] and data type of int32. B is the number of images.
+
+    """
+    num_lvl = max_level - min_level + 1
+
+    if in_dygraph_mode():
+        assert rois_num is not None, "rois_num should not be None in dygraph mode."
+        attrs = ('min_level', min_level, 'max_level', max_level, 'refer_level',
+                 refer_level, 'refer_scale', refer_scale, 'pixel_offset',
+                 pixel_offset)
+        multi_rois, restore_ind, rois_num_per_level = core.ops.distribute_fpn_proposals(
+            fpn_rois, rois_num, num_lvl, num_lvl, *attrs)
+
+        return multi_rois, restore_ind, rois_num_per_level
+
+
+class RoIAlign(object):
+    '''
+    Region of interest feature map pooler that supports pooling from 
+    one or more feature maps.
+    '''
+    def __init__(
+        self,
+        output_size,
+        scales,
+        sampling_ratio,
+        canonical_box_size=224,
+        canonical_level=4,
+        min_level=0,
+        max_level=3,
+        aligned=True
+    ):
+        '''
+        Attributes:
+            output_size (int): output size of the pooled region.
+            scales (list[float]): The scale for each low-level pooling op relative to
+                the input image. For a feature map with stride s relative to the input
+                image, scale is defined as 1/s. The stride must be power of 2.
+                When there are multiple scales, they must form a pyramid, i.e. they must be
+                a monotically decreasing geometric sequence with a factor of 1/2.
+            sampling_ratio (int): The `sampling_ratio` parameter for the ROIAlign op.
+            canonical_box_size (int): A canonical box size in pixels (sqrt(box area)). The default
+                is heuristically defined as 224 pixels in the FPN paper (based on ImageNet
+                pre-training).
+            canonical_level (int): The feature map level index from which a canonically-sized box
+                should be placed. The default is defined as level 4 (stride=16) in the FPN paper,
+                i.e., a box of size 224x224 will be placed on the feature with stride=16.
+                The box placement for all boxes will be determined from their sizes w.r.t
+                canonical_box_size. For example, a box whose area is 4x that of a canonical box
+                should be used to pool features from feature level ``canonical_level+1``.
+                Note that the actual input feature maps given to this module may not have
+                sufficiently many levels for the input boxes. If the boxes are too large or too
+                small for the input feature maps, the closest level will be used.
+            start_level (int): The start level of FPN layer to extract RoI feature, default 0.
+            end_level (int): The end level of FPN layer to extract RoI feature, default 3.
+            aligned (bool): Whether to add offset to rois' coord in roi_align. default True.
+        '''
+        super(RoIAlign, self).__init__()
+        self.output_size = output_size
+        self.scales = scales
+        self.sampling_ratio = sampling_ratio
+        self.canonical_box_size = canonical_box_size
+        self.canonical_level = canonical_level
+        self.min_level = min_level
+        self.max_level = max_level
+        self.aligned = aligned
+    
+    def __call__(self, feats, rois, rois_num):
+        '''
+        Args:
+            feats (list[tensor]): features from fpn.
+            rois (list[tensor]): proposals from rpn.
+            rois_num (list[int]): the number of each img's proposals.
+        
+        Returns:
+            roi_features (tensor): A tensor of shape (M, C, output_size, output_size)
+            where M is the total number of boxes aggregated over all N batch images
+            and C is the number of channels in `x`.
+        '''
+        if isinstance(rois_num, list):
+            rois_num = paddle.to_tensor(rois_num).astype("int32")
+        rois = paddle.concat(rois)
+
+        if len(feats) == 1:
+            roi_features = roi_align(
+                feats[self.min_level],
+                rois,
+                self.output_size,
+                self.scales[0],
+                self.sampling_ratio,
+                rois_num=rois_num,
+                aligned=self.aligned
+            )
+
+        else:
+            rois_per_level, original_ind, rois_num_per_level = distribute_fpn_proposals(
+                rois,
+                self.min_level + 2,
+                self.max_level + 2,
+                self.canonical_level,
+                self.canonical_box_size,
+                rois_num=rois_num
+            )
+
+            roi_features_per_level = []
+
+            for l in range(self.min_level, self.max_level + 1):
+                roi_feats = roi_align(
+                    feats[l],
+                    rois_per_level[l],
+                    self.output_size,
+                    self.scales[l],
+                    self.sampling_ratio,
+                    rois_num=rois_num_per_level[l],
+                    aligned = self.aligned
+                )
+
+                roi_features_per_level.append(roi_feats)
+            
+            roi_features = paddle.gather(
+                paddle.concat(roi_features_per_level),
+                original_ind
+            )
+        
+        return roi_features
+
diff --git a/object_detection/det_heads/det_utils/target_assign.py b/object_detection/det_heads/det_utils/target_assign.py
new file mode 100644
index 00000000..7969ab62
--- /dev/null
+++ b/object_detection/det_heads/det_utils/target_assign.py
@@ -0,0 +1,304 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+from .box_utils import boxes_iou, bbox2delta
+
+def anchor_target_matcher(match_quality_matrix, 
+                          positive_thresh,
+                          negative_thresh,
+                          allow_low_quality_matches,
+                          low_thresh = -float("inf")):
+    '''
+    This class assigns to each predicted "element" (e.g., a box) a ground-truth
+    element. Each predicted element will have exactly zero or one matches; each
+    ground-truth element may be matched to zero or more predicted elements.
+
+    Args:
+        match_quality_matrix (tensor): an MxN tensor, containing the pairwise quality 
+            between M ground-truth elements and N predicted elements.
+        positive_thresh (float): the positive class threshold of iou between anchors and gt.
+        negative_thresh (float): the negative class threshold of iou between anchors and gt.
+        allow_low_quality_matches (bool): if True, produce additional matches
+            for predictions with maximum match quality lower than high_threshold.
+    
+    Returns:
+        matches (tensor): a vector of length M, where matches[i] is a matched
+            ground-truth index in [0, M).
+        match_labels (tensor): a vector of length M, where pred_labels[i] indicates
+            whether a prediction is a true or false positive or ignored.
+        
+    '''
+    # matches is 1 x M, the index of anchors matching gt
+    matched_vals, matches = paddle.topk(match_quality_matrix, k = 1, axis = 0)
+    match_labels = paddle.full(matches.shape, -1, dtype = "int32")
+    neg_idx = paddle.logical_and(matched_vals > low_thresh,
+                                 matched_vals < negative_thresh)
+
+    match_labels = paddle.where(matched_vals >= positive_thresh,
+                                paddle.ones_like(match_labels), 
+                                match_labels)
+    match_labels = paddle.where(neg_idx,
+                                paddle.zeros_like(match_labels), 
+                                match_labels)
+
+    # highest_quality_foreach_gt is N x 1
+    # For each gt, find the prediction with which it has highest quality
+    if allow_low_quality_matches:
+        highest_quality_foreach_gt = match_quality_matrix.max(axis=1, keepdim=True)
+        pred_inds_with_highest_quality = paddle.logical_and(
+        match_quality_matrix > 0, match_quality_matrix == highest_quality_foreach_gt).cast('int32').sum(
+            0, keepdim=True)
+        match_labels = paddle.where(pred_inds_with_highest_quality > 0,
+                                    paddle.ones_like(match_labels),
+                                    match_labels)
+
+    matches = matches.flatten()
+    match_labels = match_labels.flatten()
+
+    return matches, match_labels
+
+
+# reference: https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/sampling.py
+def subsample_labels(labels,
+                     num_samples,
+                     positive_fraction,
+                     bg_label=0):
+    """
+    Return `num_samples` (or fewer, if not enough found)
+    random samples from `labels` which is a mixture of positives & negatives.
+    It will try to return as many positives as possible without
+    exceeding `positive_fraction * num_samples`, and then try to
+    fill the remaining slots with negatives.
+
+    Args:
+        labels (tensor): shape (N, ) label vector with values:
+            * -1: ignore
+            * bg_label: background ("negative") class
+            * otherwise: one or more foreground ("positive") classes
+        num_samples (int): The total number of labels with value >= 0 to return.
+            Values that are not sampled will be filled with -1 (ignore).
+        positive_fraction (float): The number of subsampled labels with values > 0
+            is `min(num_positives, int(positive_fraction * num_samples))`. The number
+            of negatives sampled is `min(num_negatives, num_samples - num_positives_sampled)`.
+            In order words, if there are not enough positives, the sample is filled with
+            negatives. If there are also not enough negatives, then as many elements are
+            sampled as is possible.
+        bg_label (int): label index of background ("negative") class.
+
+    Returns:
+        pos_idx, neg_idx (tensor):
+            1D vector of indices. The total length of both is `num_samples` or fewer.
+    """
+    positive = paddle.nonzero(paddle.logical_and(labels != -1, labels != bg_label))
+    negative = paddle.nonzero(labels == bg_label)
+
+    num_pos = int(num_samples * positive_fraction)
+    # protect against not enough positive examples
+    num_pos = min(positive.numel(), num_pos)
+    num_neg = num_samples - num_pos
+    # protect against not enough negative examples
+    num_neg = min(negative.numel(), num_neg)
+
+    if num_pos == 0 and num_neg == 0:
+        pos_idx = paddle.zeros([0], dtype='int32')
+        neg_idx = paddle.zeros([0], dtype='int32')
+        return pos_idx, neg_idx
+
+    # randomly select positive and negative examples
+    negative = negative.cast('int32').flatten()
+    neg_perm = paddle.randperm(negative.numel(), dtype='int32')[:int(num_neg)]
+    neg_idx = paddle.gather(negative, neg_perm)
+
+    if num_pos == 0:
+        pos_idx = paddle.zeros([0], dtype='int32')
+        return pos_idx, neg_idx
+
+    positive = positive.cast('int32').flatten()
+    pos_perm = paddle.randperm(positive.numel(), dtype='int32')[:int(num_pos)]
+    pos_idx = paddle.gather(positive, pos_perm)
+
+    return pos_idx, neg_idx
+    
+
+def anchor_target_assign(anchors,
+                         gt_boxes,
+                         positive_thresh,
+                         negative_thresh,
+                         batch_size_per_image,
+                         positive_fraction,
+                         allow_low_quality_matches=False,
+                         is_crowd=None,
+                         weights=[1., 1., 1., 1.]):
+    '''
+    Args:
+        anchors (tensor): shape [-1, 4] the sum of muti-level anchors.
+        gt_boxes (list): gt_boxes[i] is the i-th img's gt_boxes.
+        positive_thresh (float): the positive class threshold of iou between anchors and gt.
+        negative_thresh (float): the negative class threshold of iou between anchors and gt.
+        batch_size_per_image (int): number of anchors per image to sample for training.
+        positive_fraction (float): fraction of foreground anchors to sample for training.
+        allow_low_quality_matches (bool): if True, produce additional matches
+            for predictions with maximum match quality lower than high_threshold.
+        is_crowd (list | None): is_crowd[i] is is_crowd label of the i-th img's gt_boxes.
+        weights (list): more detail please see bbox2delta.
+
+    Returns:
+        tgt_labels (list[tensor]): tgt_labels[i].shape is [Ni], the label(positive or negative) of anchors.
+        tgt_bboxes (list[tensor]): tgt_bboxes[i].shape is [Ni, 4], the matched gt_boxes.
+        tgt_deltas (list[tensor]): tgt_deltas[i].shape is [Ni, 4], the deltas between anchors and gt_boxes.
+    '''
+    tgt_labels = []
+    tgt_bboxes = []
+    tgt_deltas = []
+
+    low_thresh = -float("inf")
+    for i in range(len(gt_boxes)):
+        gt_bbox = gt_boxes[i]
+        n_gt = gt_bbox.shape[0]
+        
+        if n_gt == 0 or is_crowd is None:
+            n_is_crowd = 0 
+        else:
+            is_crowd_i = is_crowd[i]
+            n_is_crowd = paddle.nonzero(is_crowd_i).shape[0]
+
+        match_quality_matrix, _ = boxes_iou(gt_bbox, anchors)
+        assert match_quality_matrix.dim() == 2
+        
+        # ignore the iou between anchor and crowded ground-truth
+        if n_is_crowd > 0:
+            n_a = anchors.shape[0]
+            ones = paddle.ones([n_a])
+            mask = is_crowd_i * ones
+            match_quality_matrix = match_quality_matrix * (1 - mask) - mask
+            low_thresh = -1
+        # match_quality_matrix is N (gt) x M (predicted)
+        # assert (match_quality_matrix >= 0).all()
+        if match_quality_matrix.shape[0] == 0 or n_gt == n_is_crowd:
+            matches = paddle.full((match_quality_matrix.shape[1], ), 0, dtype='int64')
+            match_labels = paddle.full((match_quality_matrix.shape[1], ), 0, dtype='int32')
+        else:
+            matches, match_labels = anchor_target_matcher(match_quality_matrix,
+                                                          positive_thresh,
+                                                          negative_thresh,
+                                                          allow_low_quality_matches,
+                                                          low_thresh)
+        
+        pos_idx, neg_idx = subsample_labels(match_labels, 
+                                            batch_size_per_image, 
+                                            positive_fraction)
+
+        # Fill with the ignore label (-1), then set positive and negative labels
+        labels = paddle.full(match_labels.shape, -1, dtype='int32')
+        if neg_idx.shape[0] > 0:
+            labels = paddle.scatter(labels, neg_idx, paddle.zeros_like(neg_idx))
+        if pos_idx.shape[0] > 0:
+            labels = paddle.scatter(labels, pos_idx, paddle.ones_like(pos_idx))
+        
+        if n_gt == 0:
+            matched_gt_boxes = paddle.zeros([0, 4])
+            tgt_delta = paddle.zeros([0, 4])
+        else:
+            matched_gt_boxes = paddle.gather(gt_bbox, matches)
+            tgt_delta = bbox2delta(anchors, matched_gt_boxes, weights)
+            matched_gt_boxes.stop_gradient = True
+            tgt_delta.stop_gradient = True
+
+        labels.stop_gradient = True
+        tgt_labels.append(labels)
+        tgt_bboxes.append(matched_gt_boxes)
+        tgt_deltas.append(tgt_delta)
+
+    return tgt_labels, tgt_bboxes, tgt_deltas
+
+
+def roi_target_assign(proposals,
+                      gt_boxes,
+                      gt_classes,
+                      num_classes,
+                      positive_thresh,
+                      negative_thresh,
+                      batch_size_per_image,
+                      positive_fraction,
+                      allow_low_quality_matches=False):
+    '''
+    It performs box matching between "roi" and "target",and assigns training labels
+    to the proposals. 
+
+    Args:
+        proposals (list[tensor]): the batch RoIs from rpn_head.
+        gt_boxes (list[tensor]): gt_boxes[i] is the i'th img's gt_boxes.
+        gt_classes (list[tensor]): gt_classes[i] is the i'th img's gt_classes.
+        num_classes (int): the number of class.
+    
+    Returns:
+        proposals_info (dict): a dict contains the information of proposals. 
+    '''
+
+    proposals_info = {}
+    num_fg_samples = []
+    proposals_samples = []
+    num_proposals = []
+    gt_boxes_samples = []
+    gt_cls_samples = []
+
+    for proposals_single_img, bbox_single_img, label_single_img in zip(proposals, gt_boxes, gt_classes):
+        match_quality_matrix, _ = boxes_iou(bbox_single_img, proposals_single_img)
+        matched_idxs, matched_labels = anchor_target_matcher(match_quality_matrix, 
+                                                             positive_thresh,
+                                                             negative_thresh,
+                                                             allow_low_quality_matches)
+
+        if label_single_img.numel() > 0:
+            label_single_img = label_single_img.squeeze()
+            label_single_img = paddle.gather(label_single_img, matched_idxs)
+            label_single_img = paddle.where(matched_labels == 0,
+                                            paddle.full_like(label_single_img, num_classes),
+                                            label_single_img)
+
+            label_single_img = paddle.where(matched_labels == -1,
+                                            paddle.full_like(label_single_img, -1),
+                                            label_single_img)
+        else:
+            label_single_img = paddle.zeros_like(matched_idxs) + num_classes
+            sample_gt_box = paddle.zeros_like(bbox_single_img)
+
+        sampled_fg_idxs, sampled_bg_idxs = subsample_labels(label_single_img,
+                                                            batch_size_per_image,
+                                                            positive_fraction,
+                                                            num_classes)
+
+        sampled_idxs = paddle.concat([sampled_fg_idxs, sampled_bg_idxs])
+        sample_proposal = paddle.gather(proposals_single_img, sampled_idxs)
+        sample_gt_cls = paddle.gather(label_single_img, sampled_idxs)
+
+        if label_single_img.numel() > 0:
+            sample_box_idx = paddle.gather(matched_idxs, sampled_idxs)
+            sample_gt_box = paddle.gather(bbox_single_img, sample_box_idx)
+
+        num_fg_samples.append(sampled_fg_idxs.shape[0])      
+        proposals_samples.append(sample_proposal)
+        num_proposals.append(sampled_idxs.shape[0])
+        gt_boxes_samples.append(sample_gt_box)
+        gt_cls_samples.append(sample_gt_cls)
+    
+    proposals_info["num_fg"] = num_fg_samples
+    proposals_info["proposals"] = proposals_samples
+    proposals_info["num_proposals"] = num_proposals
+    proposals_info["gt_boxes"] = gt_boxes_samples
+    proposals_info["gt_classes"] = gt_cls_samples
+
+    return proposals_info
diff --git a/object_detection/det_heads/maskrcnn_head/config.py b/object_detection/det_heads/maskrcnn_head/config.py
new file mode 100644
index 00000000..5293c9ec
--- /dev/null
+++ b/object_detection/det_heads/maskrcnn_head/config.py
@@ -0,0 +1,51 @@
+import sys
+import numpy as np
+import paddle
+from yacs.config import CfgNode as CN
+
+config = CN()
+config.FPN = CN()
+config.RPN = CN()
+config.ROI = CN()
+config.ROI.BOX_HEAD = CN()
+
+config.FPN.OUT_CHANNELS = 256
+config.RPN.ANCHOR_SIZE = [[32], [64], [128], [256], [512]]
+config.RPN.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+config.RPN.STRIDES = [4, 8, 16, 32, 64]
+config.RPN.OFFSET = 0.0
+config.RPN.PRE_NMS_TOP_N_TRAIN = 2000
+config.RPN.POST_NMS_TOP_N_TRAIN = 1000
+config.RPN.PRE_NMS_TOP_N_TEST = 1000
+config.RPN.POST_NMS_TOP_N_TEST = 1000
+config.RPN.NMS_THRESH = 0.7
+config.RPN.MIN_SIZE = 0.0
+config.RPN.TOPK_AFTER_COLLECT = True
+config.RPN.POSITIVE_THRESH = 0.7
+config.RPN.NEGATIVE_THRESH = 0.3
+config.RPN.BATCH_SIZE_PER_IMG = 256
+config.RPN.POSITIVE_FRACTION = 0.5
+config.RPN.LOW_QUALITY_MATCHES = True
+
+config.ROI.SCORE_THRESH_INFER = 0.05
+config.ROI.NMS_THRESH_INFER = 0.5
+config.ROI.NMS_KEEP_TOPK_INFER =100
+config.ROI.NUM_ClASSES = 80
+config.ROI.POSITIVE_THRESH = 0.5
+config.ROI.NEGATIVE_THRESH = 0.5
+config.ROI.BATCH_SIZE_PER_IMG = 512
+config.ROI.POSITIVE_FRACTION = 0.25
+config.ROI.LOW_QUALITY_MATCHES = True
+config.ROI.BOX_HEAD.REG_WEIGHTS = [10.0, 10.0, 5.0, 5.0]
+config.ROI.BOX_HEAD.NUM_CONV = 0
+config.ROI.BOX_HEAD.CONV_DIM = 256
+config.ROI.BOX_HEAD.NUM_FC = 2
+config.ROI.BOX_HEAD.FC_DIM = 1024
+config.ROI.SCALES = [1./4., 1./8., 1./16., 1./32., 1./64.]
+config.ROI.ALIGN_OUTPUT_SIZE = 7
+config.ROI.SAMPLING_RATIO = 0
+config.ROI.CANONICAL_BOX_SIZE = 224
+config.ROI.CANONICAL_LEVEL = 4
+config.ROI.MIN_LEVEL = 0
+config.ROI.MAX_LEVEL = 3
+config.ROI.ALIGNED = True
diff --git a/object_detection/det_heads/maskrcnn_head/roi_head.py b/object_detection/det_heads/maskrcnn_head/roi_head.py
new file mode 100644
index 00000000..179b4075
--- /dev/null
+++ b/object_detection/det_heads/maskrcnn_head/roi_head.py
@@ -0,0 +1,308 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.initializer import XavierNormal, XavierUniform, Normal
+
+from det_utils.target_assign import roi_target_assign
+from det_utils.generator_utils import RoIAlign
+from det_utils.box_utils import bbox2delta, delta2bbox, multiclass_nms
+
+
+class BoxHead(nn.Layer):
+    """
+    A head with several 3x3 conv layers (each followed by norm & relu), then
+    several fc layers (each followed by relu) and followed by two linear layers 
+    for predicting Fast R-CNN outputs.
+    """
+
+    def __init__(
+        self,
+        num_classes,
+        in_channels,
+        output_size,
+        num_conv,
+        conv_dim,
+        num_fc,
+        fc_dim,
+    ):
+        '''
+        Attributes:
+            num_classes (int): the number of class.
+            in_channels (int): the channels of inputs.
+            output_size (int): the size of output from pooler.
+            num_conv (int): the number of conv.
+            conv_dim (int): the output channels of each conv.
+            num_fc (int): the number of fc.
+            fc_dim (int): the output channels of each fc. 
+        '''
+        
+        super(BoxHead, self).__init__()
+        conv_dims = [conv_dim] * num_conv
+        fc_dims = [fc_dim] * num_fc
+        self.forward_net = nn.Sequential()
+
+        for i, channel in enumerate(conv_dims):
+            conv = nn.Conv2D(
+                in_channels=in_channels,
+                out_channels=channel,
+                kernel_size=3,
+                padding=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierNormal(fan_in=0.0)),
+                bias_attr=True
+            )
+
+            self.forward_net.add_sublayer("conv{}".format(i), conv)
+            self.forward_net.add_sublayer("act_c{}".format(i), nn.ReLU())
+            in_channels = channel
+        
+        in_dim = output_size * output_size *in_channels
+        for i, out_dim in enumerate(fc_dims):
+            if i == 0:
+                self.forward_net.add_sublayer("flatten", nn.Flatten())
+
+            fc = nn.Linear(in_dim,
+                           out_dim,
+                           weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_in=in_dim, fan_out=in_dim)))
+
+            self.forward_net.add_sublayer("linear{}".format(i), fc)
+            self.forward_net.add_sublayer("act_f{}".format(i), nn.ReLU())
+            in_dim = out_dim
+
+        self.cls_fc = nn.Linear(in_dim, 
+                                num_classes + 1,
+                                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.reg_fc = nn.Linear(in_dim, 
+                                num_classes * 4,
+                                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.001)))
+
+    def forward(self, inputs):
+        feats = self.forward_net(inputs)
+        pred_scores = self.cls_fc(feats)
+        pred_deltas = self.reg_fc(feats)
+
+        return [pred_scores, pred_deltas]
+
+
+class RoIHead(nn.Layer):
+    '''
+    RoIHead will match proposals from RPNHead with gt (when training),
+    crop the regions and extract per-region features using proposals,
+    and make per-region predictions.
+    '''
+    def __init__(self, config):
+        super(RoIHead, self).__init__()
+        self.config = config
+
+        self.pooler = RoIAlign(
+            output_size=config.ROI.ALIGN_OUTPUT_SIZE,
+            scales=config.ROI.SCALES,
+            sampling_ratio=config.ROI.SAMPLING_RATIO,
+            canonical_box_size=config.ROI.CANONICAL_BOX_SIZE,
+            canonical_level=config.ROI.CANONICAL_LEVEL,
+            min_level=config.ROI.MIN_LEVEL,
+            max_level=config.ROI.MAX_LEVEL,
+            aligned=config.ROI.ALIGNED
+        )
+
+        self.predictor = BoxHead(
+            num_classes=config.ROI.NUM_ClASSES,
+            in_channels=config.FPN.OUT_CHANNELS,
+            output_size=config.ROI.ALIGN_OUTPUT_SIZE,
+            num_conv=config.ROI.BOX_HEAD.NUM_CONV,
+            conv_dim=config.ROI.BOX_HEAD.CONV_DIM,
+            num_fc=config.ROI.BOX_HEAD.NUM_FC,
+            fc_dim=config.ROI.BOX_HEAD.FC_DIM
+        )
+    
+    def _det_forward(self, feats, proposals_info):
+        roi = proposals_info["proposals"]
+        rois_num = paddle.to_tensor(proposals_info["num_proposals"]).astype("int32")
+        roi_feats = self.pooler(feats, roi, rois_num)
+        predictions = self.predictor(roi_feats)
+
+        return predictions
+    
+    def _get_loss(self, preds, proposals_info):
+        '''
+        Args:
+            preds (list[tensor]): 
+               pred_scores (tensor) shape is (num_proposals, num_cls + 1), The pred class score.
+               pred_deltas (tensor) shape is (num_proposals, num_cls * 4), The pred location.
+        '''
+        pred_scores, pred_deltas = preds
+        n_s = pred_deltas.shape[0]
+
+        proposals = proposals_info["proposals"]
+        gt_classes = paddle.concat(proposals_info["gt_classes"]).reshape([-1])
+        gt_boxes = paddle.concat(proposals_info["gt_boxes"])
+
+        if len(proposals) == 0:
+            proposals = paddle.zeros(shape=[n_s, 4], dtype="float32")
+            tgt_scores = paddle.full(shape=[n_s,], fill_value=-1, dtype="float32")
+            tgt_boxes = paddle.zeros(shape=[n_s, 4], dtype="float32")
+        else:
+            proposals = paddle.concat(proposals)
+            tgt_scores = gt_classes.reshape([-1, 1])
+            tgt_boxes = gt_boxes.reshape([-1, 4])
+
+        losses = {
+            "loss_cls": F.cross_entropy(pred_scores, tgt_scores.astype("int64"), reduction='mean')
+        }
+
+        fg_idx = paddle.nonzero(
+            paddle.logical_and(gt_classes >= 0, gt_classes < self.config.ROI.NUM_ClASSES)
+        ).flatten()
+
+        fg_cls_base = paddle.gather(gt_classes, fg_idx)
+        fg_cls_start = paddle.arange(0, self.config.ROI.NUM_ClASSES * fg_idx.shape[0], self.config.ROI.NUM_ClASSES)
+        fg_cls_idx = fg_cls_base + fg_cls_start
+
+        fg_idx.stop_gradient = True
+        tgt_boxes.stop_gradient = True
+        proposals.stop_gradient = True
+        tgt_scores.stop_gradient = True
+        fg_cls_base.stop_gradient = True
+        fg_cls_start.stop_gradient = True
+
+        pred_deltas = pred_deltas.reshape([-1, self.config.ROI.NUM_ClASSES, 4])
+        pred_deltas = paddle.gather(pred_deltas, fg_idx, axis=0).reshape([-1, 4])
+
+        pred_deltas = paddle.gather(pred_deltas, fg_cls_idx)
+
+        tgt_boxes = paddle.gather(tgt_boxes, fg_idx)
+        proposals = paddle.gather(proposals, fg_idx)
+
+        tgt_deltas = bbox2delta(proposals, tgt_boxes, self.config.ROI.BOX_HEAD.REG_WEIGHTS)
+
+        loss_reg = F.l1_loss(pred_deltas, tgt_deltas, reduction="sum") / max(gt_classes.numel(), 1.0)
+
+        losses["loss_reg"] = loss_reg
+
+        return losses
+    
+    def _inference(self, preds, proposals_info, inputs):
+        num_proposals = proposals_info["num_proposals"]
+        proposals = proposals_info["proposals"]
+        proposals = paddle.concat(proposals)
+
+        if not len(num_proposals):
+            return None
+        
+        pred_scores, pred_deltas = preds
+
+        # pred_bbox shape [num_proposals_all, num_classes, 4]
+        pred_bbox = delta2bbox(pred_deltas, 
+                               proposals, 
+                               self.config.ROI.BOX_HEAD.REG_WEIGHTS)
+
+        pred_bbox_list = paddle.split(pred_bbox, num_proposals)
+        pred_bbox_list = paddle.split(pred_bbox, num_proposals)
+        pred_scores = F.softmax(pred_scores)
+        pred_scores_list = paddle.split(pred_scores, num_proposals)
+
+        post_pred = []
+        for i in range(len(pred_bbox_list)):
+            num_p = num_proposals[i]
+            img_pred_boxes = pred_bbox_list[i]
+            img_pred_scores = pred_scores_list[i]
+            img_hw = inputs["imgs_shape"][i]
+            img_scale_factor = inputs["scale_factor_wh"][i]
+
+            img_pred_boxes[:, :, 0::2] = paddle.clip(
+                img_pred_boxes[:, :, 0::2], min=0, max=img_hw[1]
+            ) / img_scale_factor[0]
+
+            img_pred_boxes[:, :, 1::2] = paddle.clip(
+                img_pred_boxes[:, :, 1::2], min=0, max=img_hw[0]
+            ) / img_scale_factor[1]
+
+
+            output = multiclass_nms(bboxes=img_pred_boxes,
+                                    scores=img_pred_scores[:, :-1],
+                                    score_threshold=self.config.ROI.SCORE_THRESH_INFER,
+                                    keep_top_k=self.config.ROI.NMS_KEEP_TOPK_INFER,
+                                    nms_threshold=self.config.ROI.NMS_THRESH_INFER,
+                                    background_label=self.config.ROI.NUM_ClASSES,
+                                    rois_num=paddle.to_tensor([num_p]).astype("int32"))
+
+            if output[1][0] == 0:
+                post_pred.append([])
+                continue
+
+            post_label = output[0][:, 0:1]
+            post_score = output[0][:, 1:2]
+            post_boxes = output[0][:, 2:]
+
+            boxes_w = post_boxes[:, 2] - post_boxes[:, 0]
+            boxes_h = post_boxes[:, 3] - post_boxes[:, 1]
+
+            keep = paddle.nonzero(paddle.logical_and(boxes_w > 0., boxes_h > 0.)).flatten()
+
+            post_label = paddle.gather(post_label, keep)
+            post_score = paddle.gather(post_score, keep)
+            post_boxes = paddle.gather(post_boxes, keep)
+
+            final_output = paddle.concat([post_label, post_score, post_boxes], axis=-1)
+            post_pred.append(final_output)
+        
+        return post_pred
+
+    def forward(self, feats, proposals, inputs):
+        '''
+        Args:
+            feats (list[tensor]): the outputs of fpn.
+            proposals (list[tensor]): list[i] denotes the proposals of the i'th imgs
+                from rpn head.
+            inputs (dict): the gt info, eg. gt_boxes, gt_classes, imgs_wh and so on.   
+        
+        Returns:
+            losses (dict) | outputs (list[tensor]): 
+                losses contains cls_losses and reg_losses.
+                the shape of outputs[i] is [M, 6], M is the number of final preds,
+                Each row has 6 values: [label, score, xmin, ymin, xmax, ymax]
+        '''
+
+        if self.training:
+            proposals_info = roi_target_assign(
+                proposals,
+                inputs["gt_boxes"],
+                inputs["gt_classes"],
+                self.config.ROI.NUM_ClASSES,
+                self.config.ROI.POSITIVE_THRESH,
+                self.config.ROI.NEGATIVE_THRESH,
+                self.config.ROI.BATCH_SIZE_PER_IMG,
+                self.config.ROI.POSITIVE_FRACTION,
+                self.config.ROI.LOW_QUALITY_MATCHES
+            )
+
+            predictions = self._det_forward(feats, proposals_info)
+            losses = self._get_loss(predictions, proposals_info)
+
+            return losses
+        
+        else:
+            proposals_info = {"num_proposals": [len(proposal) for proposal in proposals]}
+            proposals_info["proposals"] = proposals
+
+            predictions = self._det_forward(feats, proposals_info)
+            outputs = self._inference(predictions, proposals_info, inputs)
+
+            return outputs
\ No newline at end of file
diff --git a/object_detection/det_heads/maskrcnn_head/rpn_head.py b/object_detection/det_heads/maskrcnn_head/rpn_head.py
new file mode 100644
index 00000000..9ac650a9
--- /dev/null
+++ b/object_detection/det_heads/maskrcnn_head/rpn_head.py
@@ -0,0 +1,237 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+from paddle.nn.initializer import Normal
+
+import sys
+sys.path.append("PPViT-od_head/object_detection/ODHead")
+from det_utils.generator_utils import AnchorGenerator, ProposalGenerator
+from det_utils.target_assign import anchor_target_assign
+
+
+class RPNHead(nn.Layer):
+    """
+    Region Proposal Network uses a 3x3 conv to produce a shared hidden state from which one 1x1 conv 
+    predicts objectness logits for each anchor and a second 1x1 conv predicts bounding-box deltas.
+
+    Attributes:
+        anchor_generator (class): the generator of anchor. 
+        train_proposal (class): configure of proposals generation at the stage of training.
+        test_proposal (class): configure of proposals generation at the stage of prediction.
+        in_channels (int): channel of input feature maps which can be derived by from_config.
+    """
+    def __init__(self, config):
+        super(RPNHead, self).__init__()
+        self.anchor_generator = AnchorGenerator(anchor_sizes=config.RPN.ANCHOR_SIZE,
+                                                aspect_ratios=config.RPN.ASPECT_RATIOS,
+                                                strides=config.RPN.STRIDES,
+                                                offset=config.RPN.OFFSET)
+        self.train_proposal = ProposalGenerator(pre_nms_top_n=config.RPN.PRE_NMS_TOP_N_TRAIN,
+                                                post_nms_top_n=config.RPN.POST_NMS_TOP_N_TRAIN,
+                                                nms_thresh=config.RPN.NMS_THRESH,
+                                                min_size=config.RPN.MIN_SIZE,
+                                                topk_after_collect=config.RPN.TOPK_AFTER_COLLECT)
+        self.test_proposal = ProposalGenerator(pre_nms_top_n=config.RPN.PRE_NMS_TOP_N_TEST,
+                                               post_nms_top_n=config.RPN.POST_NMS_TOP_N_TEST,
+                                               nms_thresh=config.RPN.NMS_THRESH,
+                                               min_size=config.RPN.MIN_SIZE,
+                                               topk_after_collect=config.RPN.TOPK_AFTER_COLLECT)
+
+        self.num_anchors = self.anchor_generator.num_anchors
+
+        num_channels = config.FPN.OUT_CHANNELS
+        self.conv = nn.Conv2D(num_channels,
+                              num_channels,
+                              kernel_size=3,
+                              padding=1,
+                              weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.objectness_logits = nn.Conv2D(num_channels,
+                                           self.num_anchors,
+                                           kernel_size=1,
+                                           padding=0,
+                                           weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.anchor_deltas = nn.Conv2D(num_channels,
+                                       self.num_anchors * 4,
+                                       kernel_size=1,
+                                       padding=0,
+                                       weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+
+        self.config = config
+
+    def predict(self, feats):
+        '''
+        Predict the logits of each feature and the deltas of the anchors in each feature.
+
+        Args:
+            feats (list[tensor]): Mutil-level feature from fpn.
+
+        Returns:
+            pred_objectness_logits (list[tensor]): A list of L elements.Element i is a tensor of shape (N, A, Hi, Wi) representing
+                the predicted objectness logits for all anchors. A is the number of cell anchors.
+            pred_anchor_deltas (list[tensor]): A list of L elements. Element i is a tensor of shape (N, A * 4, Hi, Wi) 
+                representing the predicted "deltas" used to transform anchors to proposals.
+        '''
+        
+        pred_objectness_logits = []
+        pred_anchor_deltas = []
+        for feat in feats:
+            out = F.relu(self.conv(feat))
+            pred_objectness_logits.append(self.objectness_logits(out))
+            pred_anchor_deltas.append(self.anchor_deltas(out))
+
+        return pred_objectness_logits, pred_anchor_deltas
+    
+    def _get_proposals(self, scores, bbox_deltas, anchors, inputs):
+        '''
+        Args:
+            scores (list[tensor]): the prediction logits of the mutil-level features.
+                scores[i].shape is [N, A, Hi, Wi]
+            bbox_deltas (list[tensor]): the prediction anchor deltas of the mutil-level features.
+                bbox_deltas[i].shape is [N, 4 * A, Hi, Wi]
+            anchors (list[tensor]): the prediction anchor of the mutil-level features.
+                anchors[i].shape is [Hi * Wi * A, 4]
+            inputs (dict): ground truth info.
+        '''
+        proposal_gen = self.train_proposal if self.training else self.test_proposal
+
+        imgs_shape = inputs["imgs_shape"]
+        if isinstance(imgs_shape, list):
+            imgs_shape = paddle.stack(imgs_shape).astype("float32")
+
+        batch_size = len(imgs_shape)
+
+        batch_proposal_rois = []
+        batch_proposal_rois_num = []
+        for i in range(batch_size):
+            single_img_rois_list = []
+            single_img_prob_list = []
+
+            for level_scores, level_deltas, level_anchors in zip(scores, bbox_deltas, anchors):
+                level_rois, level_rois_prob, _, post_nms_top_n = proposal_gen(
+                    scores = level_scores[i:i+1],
+                    bbox_deltas = level_deltas[i:i+1],
+                    anchors = level_anchors,
+                    imgs_shape = imgs_shape[i:i+1]
+                )
+                if level_rois.shape[0] > 0:
+                    single_img_rois_list.append(level_rois)
+                    single_img_prob_list.append(level_rois_prob)
+            
+            if len(single_img_rois_list) == 0:
+                single_img_rois = paddle.zeros(shape=[0, 4]).astype("float32")
+            else:
+                single_img_rois = paddle.concat(single_img_rois_list)
+                single_img_prob = paddle.concat(single_img_prob_list).flatten()
+
+                if single_img_prob.shape[0] > post_nms_top_n:
+                    single_img_topk_prob, topk_inds = paddle.topk(single_img_prob, post_nms_top_n)
+                    single_img_topk_rois = paddle.gather(single_img_rois, topk_inds)
+                else:
+                    single_img_topk_rois = single_img_rois
+            
+            batch_proposal_rois.append(single_img_topk_rois)
+            batch_proposal_rois_num.append(single_img_topk_rois.shape[0])
+        
+        return batch_proposal_rois, batch_proposal_rois_num
+    
+    def _get_losses(self, pred_logits, pred_loc, anchors, inputs):
+        anchors = paddle.concat(anchors)
+        gt_boxes = inputs["gt_boxes"]
+        is_crowd = inputs.get("is_crowd", None)
+
+        tgt_scores, tgt_bboxes, tgt_deltas = anchor_target_assign(
+            anchors,
+            gt_boxes,
+            positive_thresh = self.config.RPN.POSITIVE_THRESH,
+            negative_thresh = self.config.RPN.NEGATIVE_THRESH,
+            batch_size_per_image = self.config.RPN.BATCH_SIZE_PER_IMG,
+            positive_fraction = self.config.RPN.POSITIVE_FRACTION,
+            allow_low_quality_matches = self.config.RPN.LOW_QUALITY_MATCHES,
+            is_crowd = is_crowd
+        )
+
+        # reshape to [N, Hi * Wi * A, 1] for compute loss
+        pred_scores = [
+            s.transpose([0, 2, 3, 1]).reshape([s.shape[0], -1, 1]) for s in pred_logits
+            ]
+        
+        pred_deltas = [
+            d.transpose([0, 2, 3, 1]).reshape([d.shape[0], -1, 4]) for d in pred_loc
+        ]
+
+        pred_scores = paddle.concat(pred_scores, axis = 1).reshape([-1])
+        pred_deltas = paddle.concat(pred_deltas, axis = 1).reshape([-1, 4])
+
+        tgt_scores = paddle.concat(tgt_scores).astype("float32")
+        tgt_deltas = paddle.concat(tgt_deltas).astype("float32")
+        tgt_scores.stop_gradient = True
+        tgt_deltas.stop_gradient = True
+
+        pos_idx = paddle.nonzero(tgt_scores == 1)
+        valid_idx = paddle.nonzero(tgt_scores >= 0)
+
+        if valid_idx.shape[0] == 0:
+            loss_rpn_cls = paddle.zeros([1]).astype("float32")
+        else:
+            pred_scores = paddle.gather(pred_scores, valid_idx)
+            tgt_scores = paddle.gather(tgt_scores, valid_idx).astype("float32")
+            tgt_scores.stop_gradient = True
+            loss_rpn_cls = F.binary_cross_entropy_with_logits(
+                logit=pred_scores, 
+                label=tgt_scores, 
+                reduction="sum"
+            )
+
+        if pos_idx.shape[0] == 0:
+            loss_rpn_reg = paddle.zeros([1]).astype("float32")
+        else:
+            pred_deltas = paddle.gather(pred_deltas, pos_idx)
+            tgt_deltas = paddle.gather(tgt_deltas, pos_idx)
+            loss_rpn_reg = paddle.abs(pred_deltas - tgt_deltas).sum()
+
+        norm = self.config.RPN.BATCH_SIZE_PER_IMG * len(gt_boxes)
+
+        return {
+            'loss_rpn_cls': loss_rpn_cls / norm,
+            'loss_rpn_reg': loss_rpn_reg / norm
+        }
+
+    def forward(self, feats, inputs):
+        '''
+        Args:
+            feats (list[tensor]): Mutil-level feature from fpn.
+            inputs (dict): ground truth info.
+        
+        Returns:
+            rois (list[tensor]): rois[i] is proposals of the i'th img.
+            rois_num (list[int]): rois[i] is number of the i'th img's proposals. 
+            losses_dict (dict | None): when training is dict contains loss_rpn_cls and loss_rpn_reg.
+        '''
+        pred_objectness_logits, pred_anchor_deltas = self.predict(feats)
+        anchors = self.anchor_generator(feats)
+
+        rois, rois_num = self._get_proposals(pred_objectness_logits, pred_anchor_deltas, anchors, inputs)
+        
+        if self.training:
+            losses_dict = self._get_losses(pred_objectness_logits, pred_anchor_deltas, anchors, inputs)
+
+            return rois, rois_num, losses_dict
+        else:
+            return rois, rois_num, None
diff --git a/object_detection/det_heads/retinaNet_head/config.py b/object_detection/det_heads/retinaNet_head/config.py
new file mode 100644
index 00000000..8799956c
--- /dev/null
+++ b/object_detection/det_heads/retinaNet_head/config.py
@@ -0,0 +1,27 @@
+import numpy as np
+import paddle
+from yacs.config import CfgNode as CN
+
+config = CN()
+config.RETINANET = CN()
+
+config.RETINANET.NUM_CONVS = 4
+config.RETINANET.INPUT_CHANNELS = 256
+config.RETINANET.NORM = ""
+config.RETINANET.PRIOR_PROB = 0.01
+config.RETINANET.NUM_CLASSES = 80
+config.RETINANET.FOCAL_LOSS_ALPHA = 0.25
+config.RETINANET.FOCAL_LOSS_GAMMA = 2
+config.RETINANET.SMOOTHL1_LOSS_DELTA = 0
+config.RETINANET.POSITIVE_THRESH = 0.5
+config.RETINANET.NEGATIVE_THRESH = 0.4
+config.RETINANET.ALLOW_LOW_QUALITY = True
+config.RETINANET.WEIGHTS = [1.0, 1.0, 1.0, 1.0]
+config.RETINANET.SCORE_THRESH = 0.05
+config.RETINANET.KEEP_TOPK = 100
+config.RETINANET.NMS_TOPK = 1000
+config.RETINANET.NMS_THRESH = 0.5
+config.RETINANET.ANCHOR_SIZE = [[x, x * 2**(1.0/3), x * 2**(2.0/3)] for x in [32, 64, 128, 256, 512 ]]
+config.RETINANET.ASPECT_RATIOS = [0.5, 1.0, 2.0]
+config.RETINANET.STRIDES = [8.0, 16.0, 32.0, 64.0, 128.0]
+config.RETINANET.OFFSET = 0
\ No newline at end of file
diff --git a/object_detection/det_heads/retinaNet_head/post_process.py b/object_detection/det_heads/retinaNet_head/post_process.py
new file mode 100644
index 00000000..79a5def8
--- /dev/null
+++ b/object_detection/det_heads/retinaNet_head/post_process.py
@@ -0,0 +1,121 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import paddle.nn.functional as F
+
+from det_utils.box_utils import nonempty_bbox, delta2bbox, multiclass_nms
+
+class RetinaNetPostProcess(object):
+    '''
+    This class used to post_process the RetianNet-Head's output.
+    '''
+    def __init__(self, 
+                 score_threshold,
+                 keep_top_k,
+                 nms_top_k,
+                 nms_threshold,
+                 bbox_reg_weights=[1.0, 1.0, 1.0, 1.0]):
+        super(RetinaNetPostProcess, self).__init__()
+        self.score_threshold=score_threshold
+        self.keep_topk=keep_top_k
+        self.topk_candidates=nms_top_k
+        self.num_thresh=nms_threshold
+        self.bbox_reg_weights = bbox_reg_weights
+
+    def _process_single_level_pred(self, box_lvl, score_lvl, anchors, scale_factor_wh, img_whwh):
+        if isinstance(scale_factor_wh, list):
+            scale_factor_wh = paddle.concat(scale_factor_wh)
+        if isinstance(img_whwh, list):
+            img_whwh = paddle.concat(img_whwh)
+
+        score_lvl = paddle.transpose(score_lvl, [0, 2, 1])
+        score_lvl = F.sigmoid(score_lvl)
+
+        batch_lvl = []
+        for i in range(len(img_whwh)):
+            box_lvl_i = delta2bbox(box_lvl[i],
+                                    anchors,
+                                    self.bbox_reg_weights).reshape(anchors.shape)
+
+            box_lvl_i[:, 0::2] = paddle.clip(
+                box_lvl_i[:, 0::2], min=0, max=img_whwh[i][0]
+            ) / scale_factor_wh[i][0]
+            box_lvl_i[:, 1::2] =  paddle.clip(
+                box_lvl_i[:, 1::2], min=0, max=img_whwh[i][1]
+            ) / scale_factor_wh[i][1]
+
+            batch_lvl.append(box_lvl_i)
+
+        box_lvl = paddle.stack(batch_lvl)
+
+        return box_lvl, score_lvl
+
+    def __call__(self, pred_scores_list, pred_boxes_list, anchors, scale_factor_wh, img_whwh):
+        """
+        Args:
+            pred_scores_list (list[Tensor]): tensor of shape (batch_size, R, num_classes).
+                The tensor predicts the classification probability for each proposal.
+            pred_boxes_list (list[Tensor]): tensors of shape (batch_size, R, 4).
+                The tensor predicts anchor's delta
+            anchors (list[Tensor]): mutil-level anchors.
+            scale_factor_wh (Tensor): tensors of shape [batch_size, 2] the scalor of  per img
+            img_whwh (Tensor): tensors of shape [batch_size, 4]
+        Returns:
+            bbox_pred (Tensor): tensors of shape [num_boxes, 6] Each row has 6 values:
+            [label, confidence, xmin, ymin, xmax, ymax]
+            bbox_num (Tensor): tensors of shape [batch_size] the number of RoIs in each image.
+        """
+        assert len(pred_boxes_list[0]) == len(scale_factor_wh) == len(img_whwh)
+        assert len(pred_boxes_list) == len(anchors)
+
+        mutil_level_bbox = []
+        mutil_level_score = []
+
+        for i in range(len(pred_boxes_list)):
+            lvl_res_b, lvl_res_s = self._process_single_level_pred(
+                pred_boxes_list[i],
+                pred_scores_list[i],
+                anchors[i],
+                scale_factor_wh,
+                img_whwh)
+
+            mutil_level_bbox.append(lvl_res_b)
+            mutil_level_score.append(lvl_res_s)
+
+        pred_boxes = paddle.concat(mutil_level_bbox, axis=1)     # [N, \sum_{i=0}^{n} (Hi * Wi), 4]
+        pred_scores = paddle.concat(mutil_level_score, axis=2)
+
+        assert pred_boxes.shape[1] == pred_scores.shape[2]
+
+        bbox_pred, bbox_num, _ = multiclass_nms(
+            pred_boxes, 
+            pred_scores,
+            score_threshold=self.score_threshold,
+            keep_top_k=self.keep_topk,
+            nms_top_k=self.topk_candidates,
+            nms_threshold=self.num_thresh,
+        )
+
+        pred_label = bbox_pred[:, 0:1]
+        pred_score = bbox_pred[:, 1:2]
+        pred_bbox = bbox_pred[:, 2:]
+        keep_mask = nonempty_bbox(pred_bbox, return_mask=True)
+        keep_mask = paddle.unsqueeze(keep_mask, [1])
+        pred_label = paddle.where(keep_mask, pred_label,
+                                  paddle.ones_like(pred_label) * -1)
+
+        pred_result = paddle.concat([pred_label, pred_score, pred_bbox], axis=1)
+
+        return pred_result, bbox_num
diff --git a/object_detection/det_heads/retinaNet_head/retinanet_head.py b/object_detection/det_heads/retinaNet_head/retinanet_head.py
new file mode 100644
index 00000000..2230323f
--- /dev/null
+++ b/object_detection/det_heads/retinaNet_head/retinanet_head.py
@@ -0,0 +1,166 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import math
+
+import paddle
+import paddle.nn as nn
+
+from paddle.nn.initializer import Normal, Constant
+
+from retinanet_loss import RetinaNetLoss
+from post_process import RetinaNetPostProcess
+from det_utils.generator_utils import AnchorGenerator
+
+class RetinaNetHead(nn.Layer):
+    '''
+    The head used in RetinaNet for object classification and box regression.
+    It has two subnets for the two tasks, with a common structure but separate parameters.
+    '''
+    def __init__(self, config):
+        '''
+        Args:
+            input_shape (List[ShapeSpec]): input shape.
+            num_classes (int): number of classes. Used to label background proposals.
+            num_anchors (int): number of generated anchors.
+            conv_dims (List[int]): dimensions for each convolution layer.
+            norm (str or callable):
+                    Normalization for conv layers except for the two output layers.
+                    See :func:`detectron2.layers.get_norm` for supported types.
+            loss_func (class): the class is used to compute loss.
+            prior_prob (float): Prior weight for computing bias.
+        '''
+        super(RetinaNetHead, self).__init__()
+
+        num_convs = config.RETINANET.NUM_CONVS
+        input_channels = config.RETINANET.INPUT_CHANNELS
+        norm = config.RETINANET.NORM
+        prior_prob = config.RETINANET.PRIOR_PROB
+
+        self.num_classes = config.RETINANET.NUM_CLASSES
+        self.get_loss = RetinaNetLoss(
+            focal_loss_alpha=config.RETINANET.FOCAL_LOSS_ALPHA,
+            focal_loss_gamma=config.RETINANET.FOCAL_LOSS_GAMMA,
+            smoothl1_loss_delta=config.RETINANET.SMOOTHL1_LOSS_DELTA,
+            positive_thresh=config.RETINANET.POSITIVE_THRESH,
+            negative_thresh=config.RETINANET.NEGATIVE_THRESH,
+            allow_low_quality=config.RETINANET.ALLOW_LOW_QUALITY,
+            num_classes=config.RETINANET.NUM_CLASSES,
+            weights=config.RETINANET.WEIGHTS
+        )
+        self.postprocess = RetinaNetPostProcess(
+            score_threshold=config.RETINANET.SCORE_THRESH,
+            keep_top_k=config.RETINANET.KEEP_TOPK,
+            nms_top_k=config.RETINANET.NMS_TOPK,
+            nms_threshold=config.RETINANET.NMS_THRESH,
+            bbox_reg_weights=config.RETINANET.WEIGHTS
+        )
+        self.anchor_generator = AnchorGenerator(anchor_sizes=config.RETINANET.ANCHOR_SIZE,
+                                                aspect_ratios=config.RETINANET.ASPECT_RATIOS,
+                                                strides=config.RETINANET.STRIDES,
+                                                offset=config.RETINANET.OFFSET)
+
+        num_anchors = self.anchor_generator.num_anchors
+        conv_dims = [input_channels] * num_convs
+
+        cls_net = []
+        reg_net = []
+
+        for in_channels, out_channels in zip(
+            [input_channels] + list(conv_dims), conv_dims
+        ):
+            cls_net.append(
+                nn.Conv2D(in_channels, out_channels, kernel_size=3, stride=1, padding=1,
+                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+            )
+            if norm == "bn":
+                cls_net.append(nn.BatchNorm2D(out_channels))
+            cls_net.append(nn.ReLU())
+
+            reg_net.append(
+                nn.Conv2D(in_channels, out_channels, kernel_size=3, stride=1, padding=1,
+                weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)))
+            )
+            if norm == "bn":
+                reg_net.append(nn.BatchNorm2D(out_channels))
+            reg_net.append(nn.ReLU())
+
+        self.cls_net = nn.Sequential(*cls_net)
+        self.reg_net = nn.Sequential(*reg_net)
+
+        bias_value = -math.log((1 - prior_prob) / prior_prob)
+        self.cls_score = nn.Conv2D(
+            conv_dims[-1], num_anchors * self.num_classes, kernel_size=3, stride=1, padding=1,
+            weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01)),
+            bias_attr=paddle.ParamAttr(initializer=Constant(bias_value))
+        )
+        self.bbox_pred = nn.Conv2D(
+            conv_dims[-1], num_anchors * 4, kernel_size=3, stride=1, padding=1,
+            weight_attr=paddle.ParamAttr(initializer=Normal(mean=0., std=0.01))
+        )
+
+    def forward(self, feats, inputs):
+        '''
+         Returns:
+            loss_dict (dict) | pred_result(tensor), bbox_num(tensor): 
+            loss_dict: contains cls_losses and reg_losses.
+            pred_result: the shape is [M, 6], M is the number of final preds,
+                Each row has 6 values: [label, score, xmin, ymin, xmax, ymax]
+            bbox_num: the shape is [N], N is the num of batch_size, 
+                bbox_num[i] means the i'th img have bbox_num[i] boxes.
+        '''
+        anchors = self.anchor_generator(feats)
+
+        pred_scores = []
+        pred_boxes = []
+
+        for feat in feats:
+            pred_scores.append(self.cls_score(self.cls_net(feat)))
+            pred_boxes.append(self.bbox_pred(self.reg_net(feat)))
+        
+        pred_scores_list = [
+            transpose_to_bs_hwa_k(s, self.num_classes) for s in pred_scores
+        ]
+        pred_boxes_list = [
+            transpose_to_bs_hwa_k(s, 4) for s in pred_boxes
+        ]
+
+        if self.training:
+            anchors = paddle.concat(anchors)
+            loss_dict = self.get_loss(anchors, [pred_scores_list, pred_boxes_list], inputs)
+
+            return loss_dict
+        
+        else:
+            img_whwh = paddle.concat([inputs["imgs_shape"][:, 1:2],
+                                      inputs["imgs_shape"][:, 0:1]], axis=-1)
+            pred_result, bbox_num = self.postprocess(
+                pred_scores_list, 
+                pred_boxes_list, 
+                anchors,
+                inputs["scale_factor_wh"], 
+                img_whwh
+            )
+
+            return pred_result, bbox_num
+
+
+def transpose_to_bs_hwa_k(tensor, k):
+    assert tensor.dim() == 4
+    bs, _, h, w = tensor.shape
+    tensor = tensor.reshape([bs, -1, k, h, w])
+    tensor = tensor.transpose([0, 3, 4, 1, 2])
+
+    return tensor.reshape([bs, -1, k])
diff --git a/object_detection/det_heads/retinaNet_head/retinanet_loss.py b/object_detection/det_heads/retinaNet_head/retinanet_loss.py
new file mode 100644
index 00000000..53cf722b
--- /dev/null
+++ b/object_detection/det_heads/retinaNet_head/retinanet_loss.py
@@ -0,0 +1,142 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+import sys
+sys.path.append("PPViT-od_head/object_detection/head")
+from det_utils.box_utils import bbox2delta, boxes_iou
+from det_utils.target_assign import anchor_target_matcher
+
+class RetinaNetLoss(nn.Layer):
+    def __init__(
+        self,
+        focal_loss_alpha,
+        focal_loss_gamma,
+        smoothl1_loss_delta,
+        positive_thresh,
+        negative_thresh,
+        allow_low_quality=True,
+        num_classes=80,
+        weights=[1.0, 1.0, 1.0, 1.0]
+    ):
+        super(RetinaNetLoss, self).__init__()
+
+        self.num_classes = num_classes
+        self.focal_loss_alpha = focal_loss_alpha
+        self.focal_loss_gamma = focal_loss_gamma
+        self.smoothl1_loss_delta = smoothl1_loss_delta
+        self.positive_thresh = positive_thresh
+        self.negative_thresh = negative_thresh
+        self.allow_low_quality = allow_low_quality
+        self.weights = weights
+
+        self.loss_normalizer = 100
+        self.loss_normalizer_momentum = 0.9
+
+    def label_anchors(self, anchors, gt):
+        batch_gt_box = gt["gt_boxes"]
+        batch_gt_class = gt["gt_classes"]
+
+        gt_labels_list = []
+        gt_boxes_list = []
+
+        for i in range(len(batch_gt_box)):
+            gt_boxes = batch_gt_box[i]
+            gt_classes = batch_gt_class[i].flatten()
+
+            match_quality_matrix, _ = boxes_iou(gt_boxes, anchors)
+            matches_idxs, match_labels = anchor_target_matcher(
+                match_quality_matrix, 
+                self.positive_thresh,
+                self.negative_thresh,
+                self.allow_low_quality,
+                low_thresh = -float("inf")
+            )
+
+            if len(gt_boxes) > 0:
+                matched_boxes_i = paddle.gather(gt_boxes, matches_idxs)
+                matched_classes_i = paddle.gather(gt_classes, matches_idxs)
+                matched_classes_i = paddle.where(match_labels == 0,
+                                                 paddle.full_like(matched_classes_i, self.num_classes),
+                                                 matched_classes_i)
+                matched_classes_i = paddle.where(match_labels == -1,
+                                                 paddle.full_like(matched_classes_i, -1),
+                                                 matched_classes_i)
+            else:
+                matched_boxes_i = paddle.zeros_like(anchors)
+                matched_classes_i = paddle.zeros_like(matches_idxs) + self.num_classes
+
+            gt_boxes_list.append(matched_boxes_i)
+            gt_labels_list.append(matched_classes_i)
+
+        return gt_boxes_list, gt_labels_list
+
+    def forward(self, anchors, preds, inputs):
+
+        pred_scores_list, pred_boxes_list = preds
+
+        p_s = paddle.concat(pred_scores_list, axis=1)
+        p_b = paddle.concat(pred_boxes_list, axis=1)  # [N, R, 4]
+
+        gt_boxes, gt_classes = self.label_anchors(anchors, inputs)
+        gt_labels = paddle.stack(gt_classes).reshape([-1])  # [N * R]
+
+        valid_idx = paddle.nonzero(gt_labels >= 0)
+        pos_mask = paddle.logical_and(gt_labels >= 0, gt_labels != self.num_classes)
+        pos_idx = paddle.nonzero(pos_mask).flatten()
+        num_pos = pos_idx.shape[0]
+
+        self.loss_normalizer = self.loss_normalizer_momentum * self.loss_normalizer + (
+            1 - self.loss_normalizer_momentum
+        ) * max(num_pos, 1)
+
+        p_s = paddle.reshape(p_s, [-1, self.num_classes])
+        pred_logits = paddle.gather(p_s, valid_idx)
+
+        gt_labels = F.one_hot(paddle.gather(gt_labels, valid_idx), num_classes=self.num_classes + 1)[
+            :, :-1
+        ]
+
+        gt_labels.stop_gradient = True
+
+        cls_loss = F.sigmoid_focal_loss(pred_logits,
+                                        gt_labels,
+                                        alpha=self.focal_loss_alpha,
+                                        gamma=self.focal_loss_gamma,
+                                        reduction='sum')
+
+        gt_deltas_list = [
+            bbox2delta(anchors, gt_boxes[i], self.weights) for i in range(len(gt_boxes))
+        ]
+
+        gt_deltas = paddle.concat(gt_deltas_list)
+        gt_deltas = paddle.gather(gt_deltas, pos_idx)
+        gt_deltas.stop_gradient = True
+
+        p_b = paddle.reshape(p_b, [-1, 4])
+        pred_deltas = paddle.gather(p_b, pos_idx)
+
+        if self.smoothl1_loss_delta > 0:
+            reg_loss = F.smooth_l1_loss(pred_deltas, gt_deltas, reduction="sum",  delta=self.smoothl1_loss_delta)
+        else:
+            reg_loss = F.l1_loss(pred_deltas, gt_deltas, reduction="sum")
+
+        return {
+            "cls_loss": cls_loss / self.loss_normalizer,
+            "reg_loss": reg_loss / self.loss_normalizer
+        }
diff --git a/object_detection/det_necks/fpn.py b/object_detection/det_necks/fpn.py
new file mode 100644
index 00000000..30579c66
--- /dev/null
+++ b/object_detection/det_necks/fpn.py
@@ -0,0 +1,183 @@
+#   Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+
+import paddle
+import paddle.nn as nn
+from paddle.nn.initializer import XavierUniform
+import paddle.nn.functional as F
+
+class ConvNorm(nn.Layer):
+    def __init__(self, 
+                 in_channels, 
+                 out_channels, 
+                 kernel_size, 
+                 stride=1, 
+                 padding=0, 
+                 dilation=1, 
+                 groups=1, 
+                 padding_mode='zeros', 
+                 weight_attr=None, 
+                 bias_attr=None,
+                 norm=""):
+        super(ConvNorm, self).__init__()
+
+        use_bias = None if norm == "" else False
+
+        self.conv = nn.Conv2D(
+            in_channels=in_channels, 
+            out_channels=out_channels, 
+            kernel_size=kernel_size, 
+            stride=stride, 
+            padding=padding, 
+            dilation=dilation, 
+            groups=groups, 
+            padding_mode=padding_mode, 
+            weight_attr=weight_attr, 
+            bias_attr=use_bias
+        )
+
+        if norm == "bn":
+            self.norm = nn.BatchNorm2D(out_channels)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        out = self.conv(x)
+
+        if self.norm is not None:
+            out = self.norm(out)
+        
+        return out
+
+
+class FPN(nn.Layer):
+    def __init__(
+        self,
+        in_channels,
+        out_channel,
+        strides,
+        fuse_type="sum",
+        use_c5=True,
+        top_block=None,
+        norm=""
+    ):
+        super(FPN, self).__init__()
+
+        assert len(strides) == len(in_channels)
+
+        self.fuse_type = fuse_type
+        self.top_block = top_block
+        self.use_c5 = use_c5
+
+        lateral_convs = []
+        output_convs = []
+
+        name_idx = [int(math.log2(s)) for s in strides]
+
+        for idx, in_channel in enumerate(in_channels):
+            lateral_conv = ConvNorm(
+                in_channels=in_channel, 
+                out_channels=out_channel, 
+                kernel_size=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=in_channel)),
+                norm=norm
+            )
+            output_conv = ConvNorm(
+                in_channels=out_channel, 
+                out_channels=out_channel, 
+                kernel_size=3,
+                padding=1,
+                weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*out_channel)),
+                norm=norm
+            )
+            self.add_sublayer("fpn_lateral{}".format(name_idx[idx]), lateral_conv)
+            self.add_sublayer("fpn_output{}".format(name_idx[idx]), output_conv)
+
+            lateral_convs.append(lateral_conv)
+            output_convs.append(output_conv)
+        
+        self.lateral_convs = lateral_convs[::-1]
+        self.output_convs = output_convs[::-1]
+
+    def forward(self, feats):
+        res = []
+        lateral_out = self.lateral_convs[0](feats[-1])
+        res.append(self.output_convs[0](lateral_out))
+
+        for idx, (lateral_conv, output_conv) in enumerate(
+            zip(self.lateral_convs, self.output_convs)
+        ):
+            if idx > 0:  # not include lateral_convs[0]
+                top2down_feat = F.interpolate(lateral_out, scale_factor=2.0, mode="nearest")
+                prev_out = lateral_conv(feats[-1-idx])
+                lateral_out = prev_out + top2down_feat
+                if self.fuse_type == "avg":
+                    lateral_out /= 2
+                res.insert(0, output_conv(lateral_out))
+        
+        if self.top_block is not None:
+            if self.use_c5:
+                top_block_out = self.top_block(feats[-1])
+            else:
+                top_block_out = self.top_block(res[-1])
+        
+            res.extend(top_block_out)
+
+        return res
+
+
+class LastLevelMaxPool(nn.Layer):
+    """
+    This module is used in the original FPN to generate a downsampled
+    P6 feature from P5.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return [F.max_pool2d(x, kernel_size=1, stride=2)]
+
+
+class TopFeatP6P7(nn.Layer):
+    """
+    This module is used in RetinaNet to generate extra layers, P6 and P7 from
+    C5 feature.
+    """
+    def __init__(self, in_channel, out_channel):
+
+        self.p6 = nn.Conv2D(
+            in_channels=in_channel, 
+            out_channels=out_channel, 
+            kernel_size=3, 
+            stride=2, 
+            padding=1,
+            weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*in_channel))
+        )
+        self.p7 = nn.Conv2D(
+            in_channels=in_channel, 
+            out_channels=out_channel, 
+            kernel_size=3, 
+            stride=2, 
+            padding=1,
+            weight_attr=paddle.ParamAttr(initializer=XavierUniform(fan_out=9*out_channel))
+        )
+    
+    def forward(self, feat):
+        p6 = self.p6(feat)
+        p7 = self.p7(F.relu(p6))
+
+        return [p6, p7]
\ No newline at end of file
diff --git a/semantic_segmentation/README.md b/semantic_segmentation/README.md
index a54bd5d2..adc6d8f9 100644
--- a/semantic_segmentation/README.md
+++ b/semantic_segmentation/README.md
@@ -1,3 +1,4 @@
+English | [简体中文](./README_cn.md)
 
 # Semantic segmentation toolkit based on Visual Transformers
 
@@ -119,14 +120,14 @@ Trans10K_cls12
 #### Single-scale testing on single GPU
 ```shell
 CUDA_VISIBLE_DEVICES=0 python3  val.py  \
-    --config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
     --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams
 ```
 
 #### Multi-scale testing on single GPU
 ```shell
 CUDA_VISIBLE_DEVICES=0,1 python3 val.py \
-    --config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
     --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams \
     --multi_scales True
 ```
@@ -134,14 +135,14 @@ CUDA_VISIBLE_DEVICES=0,1 python3 val.py \
 #### Single-scale testing on multi GPU
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch val.py \
-    --config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
     --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams
 ```
 
 #### Multi-scale testing on multi GPU
 ```shell                                                                                                                                                                                       
 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch val.py \
-    --config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
     --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams \
     --multi_scales True
 ```
@@ -156,7 +157,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch val.py \
 
 ```shell
 CUDA_VISIBLE_DEVICES=0 python3  train.py \
-    --config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
 ```
 > Note:
 > - The training options such as lr, image size, model layers, etc., can be changed in the `.yaml` file set in `-cfg`. All the available settings can be found in `./config.py`
@@ -165,7 +166,7 @@ CUDA_VISIBLE_DEVICES=0 python3  train.py \
 
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch train.py \
-    --config ./configs/SETR/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
 
 ```
 > Note:
diff --git a/semantic_segmentation/README_cn.md b/semantic_segmentation/README_cn.md
new file mode 100644
index 00000000..d406667f
--- /dev/null
+++ b/semantic_segmentation/README_cn.md
@@ -0,0 +1,178 @@
+简体中文 | [English](./README.md)
+
+# 基于 Visual Transformers 的语义分割工具
+
+语义分割旨在将图像中的每个像素分类到指定的语义类别，包括objects（例如，自行车、汽车、人）和stuff(例如，道路、长凳、天空).
+
+<div align="center">
+  <img src="figure/ppvit_seg.png" width="700px" />
+</div>
+
+## 环境配置
+此代码在以下配置下开发：
+
+Hardware: 1/2/4/8 GPU for training and testing
+Software: Centos 6.10, CUDA=10.2 Python=3.8, Paddle=2.1.0
+
+## 安装
+1. 创建conda虚拟环境并激活环境.
+
+```shell
+conda create -n paddlevit python=3.8
+conda activate ppvit
+```
+
+2. 按照官方说明安装PaddlePaddle：
+```shell
+conda install paddlepaddle-gpu==2.1.0 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
+```
+
+3. 安装 PaddleViT
+```shell
+git clone https://github.com/BR-IDL/PaddleViT.git
+cd PaddleViT/semantic_segmentation
+pip3 install -r requirements.txt
+```
+
+## Demo
+我们提供了一个demo脚本[demo.py](./demo/demo.py)，对单张图像进行推理操作，你可以将输入图像放在 `./demo/img`.
+```shell
+cd demo
+CUDA_VISIBLE_DEVICES=0 python3 demo.py \
+    --config ${CONFIG_FILE} \
+    --model_path ${MODEL_PATH} \
+    --pretrained_backbone ${PRETRAINED_BACKBONE} \
+    --img_dir ${IMAGE_DIRECTORY} \
+    --results_dir ${RESULT_DIRECTRORY}
+```
+举例如下：
+```shell
+cd demo
+CUDA_VISIBLE_DEVICES=0 python3 demo.py \
+    --config ../configs/setr/SETR_PUP_Large_768x768_80k_cityscapes_bs_8.yaml \
+    --model_path ../pretrain_models/setr/SETR_PUP_cityscapes_b8_80k.pdparams \
+    --pretrained_backbone ../pretrain_models/backbones/vit_large_patch16_224.pdparams \
+    --img_dir ./img/ \
+    --results_dir ./results/
+```
+
+
+## Quick start: 训练并验证模型
+
+### 1. 准备数据
+#### Pascal-Context 数据集
+下载Pascal-Context 数据集. "pascal_context/SegmentationClassContext" 是通过运行脚本 [voc2010_to_pascalcontext.py](tools/voc2010_to_pascalcontext.py)生成的.
+具体来说，从http://host.robots.ox.ac.uk/pascal/VOC/voc2010/VOCtrainval_03-May-2010.tar 下载PASCAL VOC2010 ,从https://codalabuser.blob.core.windows.net/public/trainval_merged.json 下载注释文件. 它应该具有以下基本结构:  
+```
+pascal_context
+|-- Annotations
+|-- ImageSets
+|-- JPEGImages
+|-- SegmentationClass
+|-- SegmentationClassContext
+|-- SegmentationObject
+|-- trainval_merged.json
+|-- voc2010_to_pascalcontext.py
+```
+#### ADE20K 数据集
+从http://sceneparsing.csail.mit.edu/ 下载ADE20K 数据集.  它应该具有以下基本结构: 
+```
+ADEChallengeData2016
+|-- annotations
+|   |-- training
+|   `-- validation
+|-- images
+|   |-- training
+|   `-- validation
+|-- objectInfo150.txt
+`-- sceneCategories.txt
+```
+### Cityscapes 数据集
+从https://www.cityscapes-dataset.com/ 下载Cityscapes数据集. **labelTrainIds.png** 用于cityscapes training, 由[convert_cityscapes.py](tools/convert_cityscapes.py)生成. 它应该具有以下基本结构:
+```
+cityscapes
+|-- gtFine
+|   |-- test
+|   |-- train
+|   `-- val
+|-- leftImg8bit
+|   |-- test
+|   |-- train
+|   `-- val
+```
+### Trans10kV2 数据集
+从 [Google Drive](https://drive.google.com/file/d/1YzAAMY8xfL9BMTIDU-nFC3dcGbSIBPu5/view?usp=sharing)或者[Baidu Drive](https://pan.baidu.com/s/1P-2l-Q2brbnwRd2kXi--Dg)（code: oqms）下载 Trans10kV2 数据集。
+它应该具有以下基本结构：
+
+```
+Trans10K_cls12
+|-- test
+|   |-- images
+|   `-- masks_12
+|-- train
+|   |-- images
+|   `-- masks_12
+|-- validation
+|   |-- images
+|   `-- masks_12
+```
+
+### 2. 测试
+#### 在单GPU上进行单尺度测试
+```shell
+CUDA_VISIBLE_DEVICES=0 python3  val.py  \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams
+```
+
+#### 在单GPU上进行多尺度测试
+```shell
+CUDA_VISIBLE_DEVICES=0,1 python3 val.py \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams \
+    --multi_scales True
+```
+
+#### 在多GPU上进行单尺度测试 
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch val.py \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams
+```
+
+#### 在多GPU上进行多尺度测试
+```shell                                                                                                                                                                                       
+CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch val.py \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml \
+    --model_path ./pretrain_models/setr/SETR_MLA_pascal_context_b8_80k.pdparams \
+    --multi_scales True
+```
+
+> 注意:
+>
+> - that the `-model_path` 选项以预训练权重文件的路径作为输入 (分割模型, e.g., setr)
+
+
+### 3. 训练
+#### 单GPU训练
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python3  train.py \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
+```
+> 注意:
+> - 可以在`-cfg`中设置的 `.yaml`文件中更改lr,图像尺寸，模型层等训练选项。所有可用的设置均在`./config.py`可以找到。
+
+#### 多GPU训练
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -u -m paddle.distributed.launch train.py \
+    --config ./configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
+
+```
+> 注意:
+> - 可以在`-cfg`中设置的 `.yaml`文件中更改lr,图像尺寸，模型层等训练选项。所有可用的设置均在`./config.py`可以找到。
+
+
+## Contact
+如果您有任何问题, 请在我们的Github上创建一个[issue](https://github.com/BR-IDL/PaddleViT/issues).
diff --git a/semantic_segmentation/config.py b/semantic_segmentation/config.py
index 753dcb38..5758697b 100644
--- a/semantic_segmentation/config.py
+++ b/semantic_segmentation/config.py
@@ -62,9 +62,20 @@
 _C.MODEL.TRANS.STRIDES = [4, 2, 2, 2]
 _C.MODEL.TRANS.SR_RATIOS = [8, 4, 2, 1]
 
+## special settings for CSwin Transformer
+_C.MODEL.TRANS.SPLIT_SIZES = None
+
+## special settings for Focal Transformer
+_C.MODEL.TRANS.FOCAL_STAGES = None
+_C.MODEL.TRANS.FOCAL_LEVELS = None
+_C.MODEL.TRANS.FOCAL_WINDOWS = None
+_C.MODEL.TRANS.EXPAND_STAGES = None
+_C.MODEL.TRANS.EXPAND_SIZES = None
+_C.MODEL.TRANS.USE_CONV_EMBED = True
+
 # MLA Decoder setting
 _C.MODEL.MLA = CN()
-#_C.MODEL.MLA.MLA_INDEX = [2, 5, 8, 11]   # Base: [2, 5, 8, 11]; Large: [5, 11, 17, 23] 
+#_C.MODEL.MLA.MLA_INDEX = [2, 5, 8, 11]   # Base: [2, 5, 8, 11]; Large: [5, 11, 17, 23]
 _C.MODEL.MLA.MLA_CHANNELS = 256
 _C.MODEL.MLA.MLAHEAD_CHANNELS=128
 _C.MODEL.MLA.AUXIHEAD = False
@@ -100,6 +111,8 @@
 _C.MODEL.AUX = CN()
 _C.MODEL.AUX.AUXIHEAD = True
 _C.MODEL.AUX.AUXHEAD_ALIGN_CORNERS = False
+_C.MODEL.AUX.LOSS = True
+_C.MODEL.AUX.AUX_WEIGHT = 0.4
 
 # Auxilary FCN Head
 _C.MODEL.AUXFCN = CN()
@@ -124,36 +137,38 @@
 
 # training settings
 _C.TRAIN = CN()
+_C.TRAIN.LOSS = "MixSoftmaxCrossEntropyLoss"
+_C.TRAIN.WEIGHTS = [1, 0.4, 0.4, 0.4, 0.4]
 _C.TRAIN.USE_GPU = True
 _C.TRAIN.LAST_EPOCH = 0
 _C.TRAIN.BASE_LR = 0.001 #0.003 for pretrain # 0.03 for finetune
 _C.TRAIN.END_LR = 1e-4
 _C.TRAIN.DECODER_LR_COEF = 1.0
-_C.TRAIN.GRAD_CLIP = 1.0
 _C.TRAIN.ITERS = 80000
-_C.TRAIN.WEIGHT_DECAY = 0.0 # 0.0 for finetune
-_C.TRAIN.POWER=0.9
+_C.TRAIN.POWER = 0.9
 _C.TRAIN.DECAY_STEPS= 80000
 _C.TRAIN.APEX = False
+_C.TRAIN.IGNORE_INDEX = 255
 
 _C.TRAIN.LR_SCHEDULER = CN()
 _C.TRAIN.LR_SCHEDULER.NAME = 'PolynomialDecay'
-_C.TRAIN.LR_SCHEDULER.MILESTONES = "30, 60, 90" # only used in StepLRScheduler
-_C.TRAIN.LR_SCHEDULER.DECAY_EPOCHS = 30 # only used in StepLRScheduler
-_C.TRAIN.LR_SCHEDULER.DECAY_RATE = 0.1 # only used in StepLRScheduler
-_C.TRAIN.LR_SCHEDULER.POWER = 0.9 # only used in PolynomialDecay
-_C.TRAIN.LR_SCHEDULER.GAMMA = 0.1
-_C.TRAIN.LR_SCHEDULER.OHEM = False # whether to use ohem
-_C.TRAIN.LR_SCHEDULER.AUX = False # whether to use aux loss
-_C.TRAIN.LR_SCHEDULER.AUX_WEIGHT = 0.4 # aux loss weight
-_C.TRAIN.LR_SCHEDULER.LOSS_NAME = '' # loss name
-_C.TRAIN.LR_SCHEDULER.DECODER_LR_FACTOR = 10.0 # decoder lr x10
+_C.TRAIN.LR_SCHEDULER.WARM_UP_STEPS = 0
+_C.TRAIN.LR_SCHEDULER.WARM_UP_LR_INIT = 0.0
+_C.TRAIN.LR_SCHEDULER.MILESTONES = [30, 60, 90]
+_C.TRAIN.LR_SCHEDULER.POWER = 0.9 # learning rate scheduler for WarmupPolyLR
+_C.TRAIN.LR_SCHEDULER.GAMMA = 0.1 # learning rate scheduler for WarmupMultiStepLR
 
 _C.TRAIN.OPTIMIZER = CN()
 _C.TRAIN.OPTIMIZER.NAME = 'SGD'
 _C.TRAIN.OPTIMIZER.EPS = 1e-8
 _C.TRAIN.OPTIMIZER.BETAS = (0.9, 0.999)  # for adamW
 _C.TRAIN.OPTIMIZER.MOMENTUM = 0.9
+_C.TRAIN.OPTIMIZER.NESTEROV = False
+_C.TRAIN.OPTIMIZER.WEIGHT_DECAY = 0.0
+_C.TRAIN.OPTIMIZER.CENTERTED = False
+_C.TRAIN.OPTIMIZER.RHO = 0.95
+_C.TRAIN.OPTIMIZER.GRAD_CLIP = None
+
 
 # Trans2Seg settings
 _C.MODEL.TRANS2SEG = CN()
@@ -178,7 +193,7 @@
 
 # misc
 _C.SAVE_DIR = "./output"
-_C.KEEP_CHECKPOINT_MAX = 3
+_C.KEEP_CHECKPOINT_MAX = 10
 _C.TAG = "default"
 _C.SAVE_FREQ_CHECKPOINT = 1000 # freq to save chpt
 _C.LOGGING_INFO_FREQ = 50 # freq to logging info
diff --git a/semantic_segmentation/configs/dpt/DPT_Large_480x480_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/dpt/DPT_Large_480x480_160k_ade20k_bs_16.yaml
index a5c4a3fe..ada0fa0b 100644
--- a/semantic_segmentation/configs/dpt/DPT_Large_480x480_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/dpt/DPT_Large_480x480_160k_ade20k_bs_16.yaml
@@ -29,14 +29,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b0_256x256_20k_vaihingen.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b0_256x256_20k_vaihingen.yaml
index 906e0df0..d87a5519 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b0_256x256_20k_vaihingen.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b0_256x256_20k_vaihingen.yaml
@@ -36,14 +36,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.01
+        GRAD_CLIP: 1.0
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b0_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b0_512x512_160k_ade20k.yaml
index b1c2af23..f1ee8a08 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b0_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b0_512x512_160k_ade20k.yaml
@@ -36,14 +36,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.01
+        GRAD_CLIP: 1.0
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b1_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b1_512x512_160k_ade20k.yaml
index 7857425a..40a75ab5 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b1_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b1_512x512_160k_ade20k.yaml
@@ -36,14 +36,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.01
+        GRAD_CLIP: 1.0
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b2_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b2_512x512_160k_ade20k.yaml
index 82638ea0..89e28a84 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b2_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b2_512x512_160k_ade20k.yaml
@@ -36,14 +36,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.01
+        GRAD_CLIP: 1.0
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b3_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b3_512x512_160k_ade20k.yaml
index 892fb0b9..8c99196a 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b3_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b3_512x512_160k_ade20k.yaml
@@ -36,14 +36,15 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
+
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.01
+        GRAD_CLIP: 1.0
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b4_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b4_512x512_160k_ade20k.yaml
index 7984534c..589808ba 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b4_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b4_512x512_160k_ade20k.yaml
@@ -36,14 +36,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.01
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segformer/segformer_mit-b5_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/segformer/segformer_mit-b5_512x512_160k_ade20k.yaml
index fde4eed4..e28ad389 100644
--- a/semantic_segmentation/configs/segformer/segformer_mit-b5_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/segformer/segformer_mit-b5_512x512_160k_ade20k.yaml
@@ -36,14 +36,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 2000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 2000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.01
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_Large_480x480_160k_pascal_content_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_Large_480x480_160k_pascal_content_bs_16.yaml
index c55da3e6..c9604bb0 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_Large_480x480_160k_pascal_content_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_Large_480x480_160k_pascal_content_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_Large_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_Large_512x512_160k_ade20k_bs_16.yaml
index df677dd8..44d35aca 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_Large_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_Large_512x512_160k_ade20k_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_base_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_base_512x512_160k_ade20k_bs_16.yaml
index 8548c9f1..2e6a885c 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_base_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_base_512x512_160k_ade20k_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_base_distilled_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_base_distilled_512x512_160k_ade20k_bs_16.yaml
index 8aed0d1b..f8c0d473 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_base_distilled_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_base_distilled_512x512_160k_ade20k_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_base_distilled_linear_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_base_distilled_linear_512x512_160k_ade20k_bs_16.yaml
index bb22d05d..bd9fd626 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_base_distilled_linear_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_base_distilled_linear_512x512_160k_ade20k_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_base_linear_256x256_20k_vaihingen_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_base_linear_256x256_20k_vaihingen_bs_16.yaml
index 8d4cc65e..2b349a4a 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_base_linear_256x256_20k_vaihingen_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_base_linear_256x256_20k_vaihingen_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 0.0
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 20000
-    WEIGHT_DECAY: 0.01
     POWER: 1.0
     DECAY_STEPS: 20000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.01
         NAME: 'AdamW'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_small_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_small_512x512_160k_ade20k_bs_16.yaml
index 3edea0b4..5205ba1f 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_small_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_small_512x512_160k_ade20k_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/segmenter/segmenter_tiny_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/segmenter/segmenter_tiny_512x512_160k_ade20k_bs_16.yaml
index fe9556be..bf5d6b96 100644
--- a/semantic_segmentation/configs/segmenter/segmenter_tiny_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/segmenter/segmenter_tiny_512x512_160k_ade20k_bs_16.yaml
@@ -27,14 +27,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_16.yaml b/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_16.yaml
index ad8912a2..13eb32bf 100644
--- a/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_16.yaml
+++ b/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_16.yaml
@@ -30,14 +30,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
index c4e05d03..2a7ff607 100644
--- a/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_MLA_Large_480x480_80k_pascal_context_bs_8.yaml
@@ -30,14 +30,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_MLA_Large_512x512_160k_ade20k_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_MLA_Large_512x512_160k_ade20k_bs_8.yaml
index 194c81f6..5b1664a1 100644
--- a/semantic_segmentation/configs/setr/SETR_MLA_Large_512x512_160k_ade20k_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_MLA_Large_512x512_160k_ade20k_bs_8.yaml
@@ -30,14 +30,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_40k_cityscapes_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_40k_cityscapes_bs_8.yaml
index e1053f2f..948b6686 100644
--- a/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_40k_cityscapes_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_40k_cityscapes_bs_8.yaml
@@ -30,14 +30,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 40000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 40000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_80k_cityscapes_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_80k_cityscapes_bs_8.yaml
index 896bb73d..fe1ce250 100644
--- a/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_80k_cityscapes_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_MLA_Large_768x768_80k_cityscapes_bs_8.yaml
@@ -30,14 +30,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_Naive_Large_480x480_80k_pascal_context_bs_16.yaml b/semantic_segmentation/configs/setr/SETR_Naive_Large_480x480_80k_pascal_context_bs_16.yaml
index af4b61e5..7dc8fab0 100644
--- a/semantic_segmentation/configs/setr/SETR_Naive_Large_480x480_80k_pascal_context_bs_16.yaml
+++ b/semantic_segmentation/configs/setr/SETR_Naive_Large_480x480_80k_pascal_context_bs_16.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_Naive_Large_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/setr/SETR_Naive_Large_512x512_160k_ade20k_bs_16.yaml
index b8ff0992..20c4abbf 100644
--- a/semantic_segmentation/configs/setr/SETR_Naive_Large_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/setr/SETR_Naive_Large_512x512_160k_ade20k_bs_16.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_40k_cityscapes_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_40k_cityscapes_bs_8.yaml
index e40f4ddb..06d1cc56 100644
--- a/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_40k_cityscapes_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_40k_cityscapes_bs_8.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 40000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 40000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_80k_cityscapes_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_80k_cityscapes_bs_8.yaml
index cac07aaa..821487b0 100644
--- a/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_80k_cityscapes_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_Naive_Large_768x768_80k_cityscapes_bs_8.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_PUP_Large_480x480_80k_pascal_context_bs_16.yaml b/semantic_segmentation/configs/setr/SETR_PUP_Large_480x480_80k_pascal_context_bs_16.yaml
index c93b19a2..e10a90c0 100644
--- a/semantic_segmentation/configs/setr/SETR_PUP_Large_480x480_80k_pascal_context_bs_16.yaml
+++ b/semantic_segmentation/configs/setr/SETR_PUP_Large_480x480_80k_pascal_context_bs_16.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.001
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_PUP_Large_512x512_160k_ade20k_bs_16.yaml b/semantic_segmentation/configs/setr/SETR_PUP_Large_512x512_160k_ade20k_bs_16.yaml
index a910641f..6de3703b 100644
--- a/semantic_segmentation/configs/setr/SETR_PUP_Large_512x512_160k_ade20k_bs_16.yaml
+++ b/semantic_segmentation/configs/setr/SETR_PUP_Large_512x512_160k_ade20k_bs_16.yaml
@@ -1,6 +1,6 @@
 DATA:
-    BATCH_SIZE: 2  # per GPU [total bs is set to 8 or 16]
-    BATCH_SIZE_VAL: 1  # per GPU
+    BATCH_SIZE: 1  # per GPU [total bs is set to 8 or 16]
+    BATCH_SIZE_VAL: 2  # per GPU
     DATASET: 'ADE20K' # dataset name
     DATA_PATH: '/home/ssd3/wutianyi/datasets/ADEChallengeData2016'
     CROP_SIZE: (512,512)
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_40k_cityscapes_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_40k_cityscapes_bs_8.yaml
index 8c684060..ce2fe5dc 100644
--- a/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_40k_cityscapes_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_40k_cityscapes_bs_8.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 40000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 40000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_80k_cityscapes_bs_8.yaml b/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_80k_cityscapes_bs_8.yaml
index f6d5d83c..6e388399 100644
--- a/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_80k_cityscapes_bs_8.yaml
+++ b/semantic_segmentation/configs/setr/SETR_PUP_Large_768x768_80k_cityscapes_bs_8.yaml
@@ -37,14 +37,14 @@ TRAIN:
     BASE_LR: 0.01
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 80000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 80000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_80k_trans10kv2_bs_16.yaml b/semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_16k_trans10kv2_bs_16.yaml
similarity index 53%
rename from semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_80k_trans10kv2_bs_16.yaml
rename to semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_16k_trans10kv2_bs_16.yaml
index f4b1d7f3..f37b8bcf 100644
--- a/semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_80k_trans10kv2_bs_16.yaml
+++ b/semantic_segmentation/configs/trans2seg/Trans2Seg_medium_512x512_16k_trans10kv2_bs_16.yaml
@@ -2,16 +2,20 @@ DATA:
     DATASET: "Trans10kV2"
     BATCH_SIZE: 16
     BATCH_SIZE_VAL: 1
-    DATA_PATH: 'E:/Trans10K_cls12'
+    DATA_PATH: "E:/Trans10K_cls12"
     CROP_SIZE: (512, 512)
     NUM_CLASSES: 12
 TRAIN:
     BASE_LR: 0.0001
-    ITERS: 80000
+    END_LR: 0.0
+    ITERS: 16000
+    IGNORE_INDEX: -1
+    LOSS: "MixSoftmaxCrossEntropyLoss"
     LR_SCHEDULER:
-        NAME: "PolynomialDecay"
+        NAME: "WarmupPolyLR"
     OPTIMIZER:
         NAME: 'ADAM'
+        WEIGHT_DECAY: 1e-4
 VAL:
     MULTI_SCALES_VAL: False
     IMAGE_BASE_SIZE: 512
@@ -20,12 +24,16 @@ MODEL:
     NAME: "Trans2Seg"
     ENCODER:
         TYPE: "resnet50c"
-        MULTI_GRID: 
-        MULTI_DILATION: 
+        MULTI_GRID:
+        MULTI_DILATION:
     TRANS2SEG:
         EMBED_DIM: 256
         DEPTH: 4
         NUM_HEADS: 8
         MLP_RATIO: 3.
         HID_DIM: 64
-SAVE_DIR: "./output/trans10kv2/Trans2Seg_medium_512x512_80k_trans10kv2_bs_16"
\ No newline at end of file
+    AUX:
+        AUXIHEAD: False
+        AUXHEAD_ALIGN_CORNERS: False
+    PRETRAINED: "E:/resnet50c.pdparams"
+SAVE_DIR: "./output/trans10kv2/Trans2Seg_medium_512x512_16k_trans10kv2_bs_16"
diff --git a/semantic_segmentation/configs/upernet_cswin/README.md b/semantic_segmentation/configs/upernet_cswin/README.md
new file mode 100644
index 00000000..3114d85a
--- /dev/null
+++ b/semantic_segmentation/configs/upernet_cswin/README.md
@@ -0,0 +1,29 @@
+# Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [arxiv](https://arxiv.org/pdf/2103.14030.pdf)
+# CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [arxiv](https://arxiv.org/pdf/2107.00652.pdf)
+
+## Framework
+<img src="../../figure/upernet_cswin_framework.png" alt="drawing" width="100%" height="100%"/>
+
+## Model Zoo ##
+### ADE20K ###
+|Model      | Backbone  | Batch_size | Iteration | mIoU (ss) | mIoU (ms+flip) | Backbone_checkpoint | Model_checkpoint     |     ConfigFile  |
+|-----------|-----------|------------|-----------|-----------|----------------|-----------------------------------------------|-----------------------------------------------------------------------|------------|
+| UperNet  | CSwin_Tiny |     16     |     160k   |  49.46   |           |[baidu](https://pan.baidu.com/s/1ol_gykZjgAFbJ3PkqQ2j0Q)(l1cp) | [baidu](https://pan.baidu.com/s/1gLePNLybtrax9yCQ2fcIPg)(y1eq)  |  [config](seman}tic_segmentation/configs/upernet_cswin/upernet_cswin_tiny_patch4_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Small |     16     |     160k   |  50.88   |      | [baidu](https://pan.baidu.com/s/1mSd_JdNS4DtyVNYxqVobBw)(6vwk)   | [baidu](https://pan.baidu.com/s/1a_vhHoib0-BcRwTnnSVGWA)(fz2e)   | [config](semantic_segmentation/configs/upernet_cswin/upernet_cswin_small_patch4_512x512_160k_ade20k.yaml) |
+| UperNet  | CSwin_Base |     16     |     160k   |  50.64   |      | [baidu](https://pan.baidu.com/s/1suO0jX_Tw56CVm3UhByOWg)(0ys7)   | [baidu](https://pan.baidu.com/s/1Ym-RUooqizgUDEm5jWyrhA)(83w3)   | [config](semantic_segmentation/configs/upernet_cswin/upernet_cswin_base_patch4_512x512_160k_ade20k.yaml) |
+## Reference
+```
+@article{dong2021cswin,
+  title={Cswin transformer: A general vision transformer backbone with cross-shaped windows},
+  author={Dong, Xiaoyi and Bao, Jianmin and Chen, Dongdong and Zhang, Weiming and Yu, Nenghai and Yuan, Lu and Chen, Dong and Guo, Baining},
+  journal={arXiv preprint arXiv:2107.00652},
+  year={2021}
+}
+@inproceedings{xiao2018unified,
+  title={Unified perceptual parsing for scene understanding},
+  author={Xiao, Tete and Liu, Yingcheng and Zhou, Bolei and Jiang, Yuning and Sun, Jian},
+  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
+  pages={418--434},
+  year={2018}
+}
+```
diff --git a/semantic_segmentation/configs/upernet_cswin/upernet_cswin_base_patch4_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_cswin/upernet_cswin_base_patch4_512x512_160k_ade20k.yaml
new file mode 100644
index 00000000..46c67ebe
--- /dev/null
+++ b/semantic_segmentation/configs/upernet_cswin/upernet_cswin_base_patch4_512x512_160k_ade20k.yaml
@@ -0,0 +1,64 @@
+DATA:
+    BATCH_SIZE: 1  # per GPU [total bs is set to 8 or 16]
+    BATCH_SIZE_VAL: 1  # per GPU
+    DATASET: 'ADE20K' # dataset name
+    DATA_PATH: '/home/ssd3/wutianyi/datasets/ADEChallengeData2016'
+    CROP_SIZE: (512,512)  # input_size (training)
+    NUM_CLASSES: 150
+MODEL:
+    NAME: 'UperNet_CSwin'
+    DROPOUT: 0.0   # dropout rate for linear projection
+    ATTENTION_DROPOUT: 0.0  # dropout rate for attention
+    DROP_PATH: 0.2
+    ENCODER:
+        TYPE: 'CSwinTransformer'
+        OUT_INDICES: [0, 1, 2, 3]   # stage_i
+    PRETRAINED: './pretrain_models/backbones/cswin_base_224.pdparams'
+    DECODER_TYPE: 'UperHead'
+    UPERHEAD:
+        IN_CHANNELS: [96, 192, 384, 768]
+        IN_INDEX: [0, 1, 2, 3]
+        POOL_SCALES: [1, 2, 3, 6]
+        CHANNELS: 512
+        DROP_RATIO: 0.1
+        ALIGN_CORNERS: False
+    TRANS:
+        PATCH_SIZE: 4
+        IN_CHANNELS: 3
+        HIDDEN_SIZE: 96  # 64(tiny, small), 96(base), 144(large)
+        EMBED_DIM: 96
+        STAGE_DEPTHS: [2, 4, 32, 2]
+        NUM_HEADS: [4, 8, 16, 32]
+        SPLIT_SIZES: [1, 2, 7, 7]       # cswin
+        MLP_RATIO: 4
+        QKV_BIAS: True
+        QK_SCALE: None
+        APE: False  # absolute positional embeddings
+        PATCH_NORM: True
+    AUX:
+        AUXIHEAD: True
+    AUXFCN:
+        IN_CHANNELS: 384   # channel of the 1/16 resolution features
+        UP_RATIO: 16
+
+TRAIN:
+    BASE_LR: 0.00006
+    END_LR: 1e-4
+    DECODER_LR_COEF: 10.0
+    ITERS: 160000
+    POWER: 0.9
+    DECAY_STEPS: 160000
+    LR_SCHEDULER:
+        NAME: 'PolynomialDecay'
+    OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
+        NAME: 'SGD'
+        MOMENTUM: 0.9
+VAL:
+    MULTI_SCALES_VAL: False
+    SCALE_RATIOS: [0.5, 0.75, 1.0]
+    IMAGE_BASE_SIZE: 512
+    CROP_SIZE: [512,512]
+    STRIDE_SIZE: [341,341]
+SAVE_DIR: "./output/UperNet_cswin_base_patch4_512x512_160k_ade20k"
diff --git a/semantic_segmentation/configs/upernet_cswin/upernet_cswin_small_patch4_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_cswin/upernet_cswin_small_patch4_512x512_160k_ade20k.yaml
new file mode 100644
index 00000000..3ad38105
--- /dev/null
+++ b/semantic_segmentation/configs/upernet_cswin/upernet_cswin_small_patch4_512x512_160k_ade20k.yaml
@@ -0,0 +1,64 @@
+DATA:
+    BATCH_SIZE: 1  # per GPU [total bs is set to 8 or 16]
+    BATCH_SIZE_VAL: 1  # per GPU
+    DATASET: 'ADE20K' # dataset name
+    DATA_PATH: '/home/ssd3/wutianyi/datasets/ADEChallengeData2016'
+    CROP_SIZE: (512,512)  # input_size (training)
+    NUM_CLASSES: 150
+MODEL:
+    NAME: 'UperNet_CSwin'
+    DROPOUT: 0.0   # dropout rate for linear projection
+    ATTENTION_DROPOUT: 0.0  # dropout rate for attention
+    DROP_PATH: 0.2
+    ENCODER:
+        TYPE: 'CSwinTransformer'
+        OUT_INDICES: [0, 1, 2, 3]   # stage_i
+    PRETRAINED: './pretrain_models/backbones/cswin_small_224.pdparams'
+    DECODER_TYPE: 'UperHead'
+    UPERHEAD:
+        IN_CHANNELS: [64, 128, 256, 512]
+        IN_INDEX: [0, 1, 2, 3]
+        POOL_SCALES: [1, 2, 3, 6]
+        CHANNELS: 512
+        DROP_RATIO: 0.1
+        ALIGN_CORNERS: False
+    TRANS:
+        PATCH_SIZE: 4
+        IN_CHANNELS: 3
+        HIDDEN_SIZE: 64  # 64(tiny, small), 96(base), 144(large)
+        EMBED_DIM: 64
+        STAGE_DEPTHS: [2, 4, 32, 2]
+        NUM_HEADS: [2, 4, 8, 16]
+        SPLIT_SIZES: [1, 2, 7, 7]       # cswin
+        MLP_RATIO: 4
+        QKV_BIAS: True
+        QK_SCALE: None
+        APE: False  # absolute positional embeddings
+        PATCH_NORM: True
+    AUX:
+        AUXIHEAD: True
+    AUXFCN:
+        IN_CHANNELS: 256   # channel of the 1/16 resolution features
+        UP_RATIO: 16
+
+TRAIN:
+    BASE_LR: 0.00006
+    END_LR: 1e-4
+    DECODER_LR_COEF: 10.0
+    ITERS: 160000
+    POWER: 0.9
+    DECAY_STEPS: 160000
+    LR_SCHEDULER:
+        NAME: 'PolynomialDecay'
+    OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
+        NAME: 'SGD'
+        MOMENTUM: 0.9
+VAL:
+    MULTI_SCALES_VAL: False
+    SCALE_RATIOS: [0.5, 0.75, 1.0]
+    IMAGE_BASE_SIZE: 512
+    CROP_SIZE: [512,512]
+    STRIDE_SIZE: [341,341]
+SAVE_DIR: "./output/UperNet_cswin_small_patch4_512x512_160k_ade20k"
diff --git a/semantic_segmentation/configs/upernet_cswin/upernet_cswin_tiny_patch4_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_cswin/upernet_cswin_tiny_patch4_512x512_160k_ade20k.yaml
new file mode 100644
index 00000000..8feb6d54
--- /dev/null
+++ b/semantic_segmentation/configs/upernet_cswin/upernet_cswin_tiny_patch4_512x512_160k_ade20k.yaml
@@ -0,0 +1,64 @@
+DATA:
+    BATCH_SIZE: 8  # per GPU [total bs is set to 8 or 16]
+    BATCH_SIZE_VAL: 1  # per GPU
+    DATASET: 'ADE20K' # dataset name
+    DATA_PATH: '/home/ssd3/wutianyi/datasets/ADEChallengeData2016'
+    CROP_SIZE: (512,512)  # input_size (training)
+    NUM_CLASSES: 150
+MODEL:
+    NAME: 'UperNet_CSwin'
+    DROPOUT: 0.0   # dropout rate for linear projection
+    ATTENTION_DROPOUT: 0.0  # dropout rate for attention
+    DROP_PATH: 0.2
+    ENCODER:
+        TYPE: 'CSwinTransformer'
+        OUT_INDICES: [0, 1, 2, 3]   # stage_i
+    PRETRAINED: './pretrain_models/backbones/cswin_tiny_224.pdparams'
+    DECODER_TYPE: 'UperHead'
+    UPERHEAD:
+        IN_CHANNELS: [64, 128, 256, 512]
+        IN_INDEX: [0, 1, 2, 3]
+        POOL_SCALES: [1, 2, 3, 6]
+        CHANNELS: 512
+        DROP_RATIO: 0.1
+        ALIGN_CORNERS: False
+    TRANS:
+        PATCH_SIZE: 4
+        IN_CHANNELS: 3
+        HIDDEN_SIZE: 64  # 64(tiny, small), 96(base), 144(large)
+        EMBED_DIM: 64
+        STAGE_DEPTHS: [1, 2, 21, 1]
+        NUM_HEADS: [2, 4, 8, 16]
+        SPLIT_SIZES: [1, 2, 7, 7]       # cswin
+        MLP_RATIO: 4
+        QKV_BIAS: True
+        QK_SCALE: None
+        APE: False  # absolute positional embeddings
+        PATCH_NORM: True
+    AUX:
+        AUXIHEAD: True
+    AUXFCN:
+        IN_CHANNELS: 256   # channel of the 1/16 resolution features
+        UP_RATIO: 16
+
+TRAIN:
+    BASE_LR: 0.00006
+    END_LR: 1e-4
+    DECODER_LR_COEF: 10.0
+    ITERS: 160000
+    POWER: 0.9
+    DECAY_STEPS: 160000
+    LR_SCHEDULER:
+        NAME: 'PolynomialDecay'
+    OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
+        NAME: 'SGD'
+        MOMENTUM: 0.9
+VAL:
+    MULTI_SCALES_VAL: False
+    SCALE_RATIOS: [0.5, 0.75, 1.0]
+    IMAGE_BASE_SIZE: 512
+    CROP_SIZE: [512,512]
+    STRIDE_SIZE: [341,341]
+SAVE_DIR: "./output/UperNet_cswin_tiny_patch4_512x512_160k_ade20k"
diff --git a/semantic_segmentation/configs/upernet_focal/upernet_focal_base_patch4_windown7_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_focal/upernet_focal_base_patch4_windown7_512x512_160k_ade20k.yaml
new file mode 100644
index 00000000..579badc5
--- /dev/null
+++ b/semantic_segmentation/configs/upernet_focal/upernet_focal_base_patch4_windown7_512x512_160k_ade20k.yaml
@@ -0,0 +1,69 @@
+DATA:
+    BATCH_SIZE: 4  # per GPU [total bs is set to 8 or 16]
+    BATCH_SIZE_VAL: 1  # per GPU
+    DATASET: 'ADE20K' # dataset name
+    DATA_PATH: 'E:/ADEChallengeData2016'
+    CROP_SIZE: (512,512)  # input_size (training)
+    NUM_CLASSES: 150
+MODEL:
+    NAME: 'UperNet_Focal'
+    ENCODER:
+        TYPE: 'FocalTransformer'
+        OUT_INDICES: [0, 1, 2, 3]   # stage_i
+    PRETRAINED: None
+    DECODER_TYPE: 'UperHead'
+    UPERHEAD:
+        IN_CHANNELS: [128, 256, 512, 1024]
+        IN_INDEX: [0, 1, 2, 3]
+        POOL_SCALES: [1, 2, 3, 6]
+        CHANNELS: 512
+        DROP_RATIO: 0.1
+        ALIGN_CORNERS: False
+    TRANS:
+        PATCH_SIZE: 4
+        WINDOW_SIZE: 7
+        IN_CHANNELS: 3
+        HIDDEN_SIZE: 128
+        EMBED_DIM: 128
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [4, 8, 16, 32]
+        FOCAL_STAGES: [0, 1, 2, 3]
+        FOCAL_LEVELS: [2, 2, 2, 2]
+        FOCAL_WINDOWS: [7, 5, 3, 1]
+        EXPAND_STAGES: [0, 1, 2, 3]
+        EXPAND_SIZES: [3, 3, 3, 3]
+        USE_CONV_EMBED: True
+        MLP_RATIO: 4
+        QKV_BIAS: True
+        QK_SCALE: None
+        APE: False  # absolute positional embeddings
+        PATCH_NORM: True
+    AUX:
+        AUXIHEAD: True
+    AUXFCN:
+        IN_CHANNELS: 512
+        UP_RATIO: 16
+
+TRAIN:
+    BASE_LR: 0.00006
+    END_LR: 1e-4
+    DECODER_LR_COEF: 10.0
+    ITERS: 160000
+    POWER: 0.9
+    DECAY_STEPS: 160000
+    LR_SCHEDULER:
+        NAME: 'PolynomialDecay'
+    OPTIMIZER:
+        NAME: 'SGD'
+        MOMENTUM: 0.9
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
+VAL:
+    MULTI_SCALES_VAL: False
+    SCALE_RATIOS: [0.5, 0.75, 1.0]
+    IMAGE_BASE_SIZE: 576
+    KEEP_ORI_SIZE: False
+    RESCALE_FROM_ORI: False
+    CROP_SIZE: [512,512]
+    STRIDE_SIZE: [341,341]
+SAVE_DIR: "./output/UperNet_swin_base_patch4_windown7_512x512_160k_ade20k"
diff --git a/semantic_segmentation/configs/upernet_focal/upernet_focal_base_useconv_patch4_windown7_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_focal/upernet_focal_base_useconv_patch4_windown7_512x512_160k_ade20k.yaml
new file mode 100644
index 00000000..930be9f8
--- /dev/null
+++ b/semantic_segmentation/configs/upernet_focal/upernet_focal_base_useconv_patch4_windown7_512x512_160k_ade20k.yaml
@@ -0,0 +1,68 @@
+DATA:
+    BATCH_SIZE: 4  # per GPU [total bs is set to 8 or 16]
+    BATCH_SIZE_VAL: 1  # per GPU
+    DATASET: 'ADE20K' # dataset name
+    DATA_PATH: 'E:/ADEChallengeData2016'
+    CROP_SIZE: (512,512)  # input_size (training)
+    NUM_CLASSES: 150
+MODEL:
+    NAME: 'UperNet_Focal'
+    ENCODER:
+        TYPE: 'FocalTransformer'
+        OUT_INDICES: [0, 1, 2, 3]   # stage_i
+    PRETRAINED: None
+    DECODER_TYPE: 'UperHead'
+    UPERHEAD:
+        IN_CHANNELS: [128, 256, 512, 1024]
+        IN_INDEX: [0, 1, 2, 3]
+        POOL_SCALES: [1, 2, 3, 6]
+        CHANNELS: 512
+        DROP_RATIO: 0.1
+        ALIGN_CORNERS: False
+    TRANS:
+        PATCH_SIZE: 4
+        WINDOW_SIZE: 7
+        IN_CHANNELS: 3
+        HIDDEN_SIZE: 128
+        EMBED_DIM: 128
+        STAGE_DEPTHS: [2, 2, 18, 2]
+        NUM_HEADS: [4, 8, 16, 32]
+        FOCAL_STAGES: [0, 1, 2, 3]
+        FOCAL_LEVELS: [2, 2, 2, 2]
+        FOCAL_WINDOWS: [7, 5, 3, 1]
+        EXPAND_STAGES: [0, 1, 2, 3]
+        EXPAND_SIZES: [3, 3, 3, 3]
+        MLP_RATIO: 4
+        QKV_BIAS: True
+        QK_SCALE: None
+        APE: False  # absolute positional embeddings
+        PATCH_NORM: True
+    AUX:
+        AUXIHEAD: True
+    AUXFCN:
+        IN_CHANNELS: 512
+        UP_RATIO: 16
+
+TRAIN:
+    BASE_LR: 0.00006
+    END_LR: 1e-4
+    DECODER_LR_COEF: 10.0
+    ITERS: 160000
+    POWER: 0.9
+    DECAY_STEPS: 160000
+    LR_SCHEDULER:
+        NAME: 'PolynomialDecay'
+    OPTIMIZER:
+        NAME: 'SGD'
+        MOMENTUM: 0.9
+        GRAD_CLIP: 1.0
+        WEIGHT_DECAY: 0.0
+VAL:
+    MULTI_SCALES_VAL: False
+    SCALE_RATIOS: [0.5, 0.75, 1.0]
+    IMAGE_BASE_SIZE: 576
+    KEEP_ORI_SIZE: False
+    RESCALE_FROM_ORI: False
+    CROP_SIZE: [512,512]
+    STRIDE_SIZE: [341,341]
+SAVE_DIR: "./output/UperNet_swin_base_patch4_windown7_512x512_160k_ade20k"
diff --git a/semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml
index 9558b060..1b4da15c 100644
--- a/semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/upernet_swin/upernet_swin_base_patch4_windown7_512x512_160k_ade20k.yaml
@@ -42,14 +42,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/configs/upernet_swin/upernet_swin_small_patch4_windown7_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_swin/upernet_swin_small_patch4_windown7_512x512_160k_ade20k.yaml
index 8fe22c9d..0138ec8a 100644
--- a/semantic_segmentation/configs/upernet_swin/upernet_swin_small_patch4_windown7_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/upernet_swin/upernet_swin_small_patch4_windown7_512x512_160k_ade20k.yaml
@@ -42,20 +42,20 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
     MULTI_SCALES_VAL: False
     SCALE_RATIOS: [0.5, 0.75, 1.0]
-    IMAGE_BASE_SIZE: 576
+    IMAGE_BASE_SIZE: 512
     CROP_SIZE: [512,512]
     STRIDE_SIZE: [341,341]
 SAVE_DIR: "./output/UperNet_swin_small_patch4_windown7_512x512_160k_ade20k"
diff --git a/semantic_segmentation/configs/upernet_swin/upernet_swin_tiny_patch4_windown7_512x512_160k_ade20k.yaml b/semantic_segmentation/configs/upernet_swin/upernet_swin_tiny_patch4_windown7_512x512_160k_ade20k.yaml
index 5097418a..72d449b2 100644
--- a/semantic_segmentation/configs/upernet_swin/upernet_swin_tiny_patch4_windown7_512x512_160k_ade20k.yaml
+++ b/semantic_segmentation/configs/upernet_swin/upernet_swin_tiny_patch4_windown7_512x512_160k_ade20k.yaml
@@ -1,6 +1,6 @@
 DATA:
     BATCH_SIZE: 2  # per GPU [total bs is set to 8 or 16]
-    BATCH_SIZE_VAL: 1  # per GPU
+    BATCH_SIZE_VAL: 4  # per GPU
     DATASET: 'ADE20K' # dataset name
     DATA_PATH: '/home/ssd3/wutianyi/datasets/ADEChallengeData2016'
     CROP_SIZE: (512,512)  # input_size (training)
@@ -42,14 +42,14 @@ TRAIN:
     BASE_LR: 0.00006
     END_LR: 1e-4
     DECODER_LR_COEF: 10.0
-    GRAD_CLIP: 1.0
     ITERS: 160000
-    WEIGHT_DECAY: 0.0
     POWER: 0.9
     DECAY_STEPS: 160000
     LR_SCHEDULER:
         NAME: 'PolynomialDecay'
     OPTIMIZER:
+        WEIGHT_DECAY: 0.0
+        GRAD_CLIP: 1.0
         NAME: 'SGD'
         MOMENTUM: 0.9
 VAL:
diff --git a/semantic_segmentation/figure/upernet_cswin_framework.png b/semantic_segmentation/figure/upernet_cswin_framework.png
new file mode 100644
index 00000000..04bd3714
Binary files /dev/null and b/semantic_segmentation/figure/upernet_cswin_framework.png differ
diff --git a/semantic_segmentation/requirements.txt b/semantic_segmentation/requirements.txt
index 8f5fba8c..d8ecc18f 100644
--- a/semantic_segmentation/requirements.txt
+++ b/semantic_segmentation/requirements.txt
@@ -1,6 +1,5 @@
 cityscapesScripts==2.2.0
-detail==4.0
 numpy==1.20.3
-opencv-python==4.5.2.52
+opencv-python==4.4.0
 scipy==1.6.3
 yacs==0.1.8
diff --git a/semantic_segmentation/src/api/infer.py b/semantic_segmentation/src/api/infer.py
index 99415b89..09c683c3 100644
--- a/semantic_segmentation/src/api/infer.py
+++ b/semantic_segmentation/src/api/infer.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import numpy as np
 import math
 import cv2
@@ -5,7 +19,7 @@
 import paddle
 import paddle.nn.functional as F
 
-def slide_inference(model, img, crop_size, stride_size, num_classes):
+def slide_inference(model, imgs, crop_size, stride_size, num_classes):
     """
     Inference by sliding-window with overlap, the overlap is equal to stride.
 
@@ -20,31 +34,47 @@ def slide_inference(model, img, crop_size, stride_size, num_classes):
         final_logit (Tensor): The logit of input image, whose size is equal to 
         the size of img (not the orginal size).
     """
-    h_img, w_img = img.shape[-2:]
+    batch_size = len(imgs)
+    h_img = [img.shape[-2] for img in imgs]
+    w_img = [img.shape[-1] for img in imgs]
+    max_h, max_w = max(h_img), max(w_img)
     w_crop, h_crop = crop_size
     w_stride, h_stride = stride_size
-    # calculate the crop nums
-    rows = max(h_img - h_crop + h_stride -1, 0) // h_stride + 1
-    cols = max(w_img - w_crop + w_stride -1, 0) // w_stride + 1
-    count = np.zeros([1, 1, h_img, w_img])
-    final_logit = paddle.zeros([1, num_classes, h_img, w_img], dtype='float32')
+    rows = max(max_h - h_crop + h_stride -1, 0) // h_stride + 1
+    cols = max(max_w - w_crop + w_stride -1, 0) // w_stride + 1
+    count = paddle.zeros([batch_size, 1, max_h, max_w])
+    final_logit = paddle.zeros([batch_size, num_classes, max_h, max_w])
     for r in range(rows):
         for c in range(cols):
-            h1 = r * h_stride
-            w1 = c * w_stride
-            h2 = min(h1 + h_crop, h_img)
-            w2 = min(w1 + w_crop, w_img)
-            h1 = max(h2 - h_crop, 0)
-            w1 = max(w2 - w_crop, 0)
-            img_crop = img[:, :, h1:h2, w1:w2]
-            logits = model(img_crop)
-            logit = logits[0]
-            final_logit += F.pad(logit, [w1, w_img - w2, h1, h_img - h2])
-            count[:, :, h1:h2, w1:w2] += 1
-    final_logit = final_logit.numpy() / count
-    final_logit = paddle.to_tensor(final_logit)
-    return final_logit
-
+            batch_list = []
+            loc_list = []
+            for i, img in enumerate(imgs):
+                h1 = r * h_stride
+                w1 = c * w_stride
+                if h1 >= img.shape[-2] or w1 >= img.shape[-1]:
+                    continue
+                h2 = min(h1 + h_crop, img.shape[-2])
+                w2 = min(w1 + w_crop, img.shape[-1])
+                h1 = max(h2 - h_crop, 0)
+                w1 = max(w2 - w_crop, 0)
+                loc_list.append((i, h1, w1, h2, w2))
+                batch_list.append(img[:, h1:h2, w1:w2].unsqueeze(0))
+            if not batch_list:
+                continue
+            batch_data = paddle.concat(batch_list, 0)
+            logits = model(batch_data)[0]
+            for i in range(batch_data.shape[0]):
+                idx, h1, w1, h2, w2 = loc_list[i]
+                logit = logits[i]
+                final_logit[idx, :, h1:h2, w1:w2] += logit[:,:,:]
+                count[idx, :, h1:h2, w1:w2] += 1
+    final_logit_list = []
+    for i in range(batch_size):
+        h, w = imgs[i].shape[-2:]
+        logit = final_logit[i:i+1, :, :h, :w]
+        count_single = count[i:i+1, :, :h, :w]
+        final_logit_list.append(logit / count_single)
+    return final_logit_list
 
 def ss_inference(model,
                  img, 
@@ -79,6 +109,15 @@ def ss_inference(model,
         h, w) is returned.
     """
     if not is_slide:
+        if not isinstance(img, collections.abc.Sequence):
+            raise TypeError("The type of img must be one of "
+                "collections.abc.Sequence, e.g. list, tuple. But received {}"
+                .format(type(img)))
+        if len(img) == 1:
+            img = img[0]
+        else:
+            raise ValueError("Considering the different shapes of inputs,"
+                "batch_size should be set to 1 while is_slide is False")
         logits = model(img)
         if not isinstance(logits, collections.abc.Sequence):
             raise TypeError("The type of logits must be one of "
@@ -99,14 +138,18 @@ def ss_inference(model,
                 h, w = new_h, new_w
                 img = F.interpolate(img, (h, w), mode='bilinear')
                 #print("rescale, img.shape: ({}, {})".format(h,w))
-        logit = slide_inference(model, img, crop_size, stride_size, num_classes)
+        logit_list = slide_inference(model, img, crop_size, stride_size, num_classes)
 
     if ori_shape is not None:
         # resize to original shape
-        logit = F.interpolate(logit, ori_shape, mode='bilinear', align_corners=False)  
-        logit = F.softmax(logit, axis=1)
-        pred = paddle.argmax(logit, axis=1, keepdim=True, dtype='int32')
-        return pred
+        pred_list = []
+        for i, logit in enumerate(logit_list):
+            shape = ori_shape[i]
+            logit = F.interpolate(logit, shape, mode='bilinear', align_corners=False)  
+            logit = F.softmax(logit, axis=1)
+            pred = paddle.argmax(logit, axis=1, keepdim=True, dtype='int32')
+            pred_list.append(pred)
+        return pred_list
     else:
         return logit
 
diff --git a/semantic_segmentation/src/datasets/ade.py b/semantic_segmentation/src/datasets/ade.py
index f5d505a8..f30dfa25 100644
--- a/semantic_segmentation/src/datasets/ade.py
+++ b/semantic_segmentation/src/datasets/ade.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import numpy as np
 from PIL import Image
diff --git a/semantic_segmentation/src/datasets/cityscapes.py b/semantic_segmentation/src/datasets/cityscapes.py
index 70c6ff06..baa260c7 100644
--- a/semantic_segmentation/src/datasets/cityscapes.py
+++ b/semantic_segmentation/src/datasets/cityscapes.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import glob
 from src.datasets import Dataset
diff --git a/semantic_segmentation/src/datasets/cocostuff.py b/semantic_segmentation/src/datasets/cocostuff.py
index 927178b2..72710e54 100644
--- a/semantic_segmentation/src/datasets/cocostuff.py
+++ b/semantic_segmentation/src/datasets/cocostuff.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import glob
 from src.datasets import Dataset
diff --git a/semantic_segmentation/src/datasets/dataset.py b/semantic_segmentation/src/datasets/dataset.py
index 5062c88d..1d1f0f61 100644
--- a/semantic_segmentation/src/datasets/dataset.py
+++ b/semantic_segmentation/src/datasets/dataset.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import paddle
 import numpy as np
diff --git a/semantic_segmentation/src/datasets/pascal_context.py b/semantic_segmentation/src/datasets/pascal_context.py
index f6c2e6e3..2034e1c4 100644
--- a/semantic_segmentation/src/datasets/pascal_context.py
+++ b/semantic_segmentation/src/datasets/pascal_context.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 from PIL import Image
 from src.datasets import Dataset
diff --git a/semantic_segmentation/src/datasets/trans10k_v2.py b/semantic_segmentation/src/datasets/trans10k_v2.py
index 97b25514..35c371ae 100644
--- a/semantic_segmentation/src/datasets/trans10k_v2.py
+++ b/semantic_segmentation/src/datasets/trans10k_v2.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import glob
 from src.datasets import Dataset
diff --git a/semantic_segmentation/src/datasets/vaihingen.py b/semantic_segmentation/src/datasets/vaihingen.py
index a08bce8e..2725d28c 100644
--- a/semantic_segmentation/src/datasets/vaihingen.py
+++ b/semantic_segmentation/src/datasets/vaihingen.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import numpy as np
 from PIL import Image
diff --git a/semantic_segmentation/src/models/backbones/__init__.py b/semantic_segmentation/src/models/backbones/__init__.py
index a072010e..f1e6a538 100644
--- a/semantic_segmentation/src/models/backbones/__init__.py
+++ b/semantic_segmentation/src/models/backbones/__init__.py
@@ -1,6 +1,8 @@
 from .vit_mla import ViT_MLA
 from .vit import VisualTransformer
 from .swin_transformer import SwinTransformer
+from .cswin_transformer import CSwinTransformer
+from .focal_transformer import FocalTransformer
 from .deit import Deit
 from .resnet import *
-from .trans2seg_transformer import *
\ No newline at end of file
+from .trans2seg_transformer import *
diff --git a/semantic_segmentation/src/models/backbones/cswin_transformer.py b/semantic_segmentation/src/models/backbones/cswin_transformer.py
new file mode 100644
index 00000000..c7c42993
--- /dev/null
+++ b/semantic_segmentation/src/models/backbones/cswin_transformer.py
@@ -0,0 +1,555 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Implement Transformer Class for CSwin
+"""
+
+import copy
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+
+class Identity(nn.Layer):
+    """ Identity layer
+
+    The output of this layer is the input without any change.
+    Use this layer to avoid if condition in some forward methods
+
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class DropPath(nn.Layer):
+    """DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape                                                                                                                
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        output = inputs.divide(keep_prob) * random_tensor # divide is to keep same output expectation
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+class PatchEmbedding(nn.Layer):
+    """CSwin Patch Embedding
+    This patch embedding has a 7x7 conv + layernorm, the output tensor
+    is reshaped to [Batch, H*W, embed_dim]. Note that the patch is applied
+    by a conv with overlap (using patch_stride).
+
+    Args:
+        patch_stride: int, patch stride size, default: 4
+        in_channels: int, number of channels of input image, default: 3
+        embed_dim: int, output feature dimension, default: 96
+    """
+    def __init__(self, patch_stride=4, in_channels=3, embed_dim=96):
+        super().__init__()
+        self.patch_embed = nn.Conv2D(in_channels=in_channels,
+                                     out_channels=embed_dim,
+                                     kernel_size=7,
+                                     stride=patch_stride,
+                                     padding=2)
+        self.norm = nn.LayerNorm(embed_dim)
+
+    def forward(self, x):
+        x = self.patch_embed(x) # [batch, embed_dim, h, w], h = w = image_size / 4
+        x = x.flatten(start_axis=2, stop_axis=-1) # [batch, embed_dim, h*w]
+        x = x.transpose([0, 2, 1]) # [batch, h*w, embed_dim]
+        x = self.norm(x)
+        return x
+
+
+class Mlp(nn.Layer):
+    """ MLP module
+
+    Impl using nn.Linear and activation is GELU, dropout is applied.
+    Ops: fc -> act -> dropout -> fc -> dropout
+
+    Attributes:
+        fc1: nn.Linear
+        fc2: nn.Linear
+        act: GELU
+        dropout1: dropout after fc1
+        dropout2: dropout after fc2
+    """
+    def __init__(self, in_features, hidden_features, dropout):
+        super().__init__()
+        w_attr_1, b_attr_1 = self._init_weights()
+        self.fc1 = nn.Linear(in_features,
+                             hidden_features,
+                             weight_attr=w_attr_1,
+                             bias_attr=b_attr_1)
+
+        w_attr_2, b_attr_2 = self._init_weights()
+        self.fc2 = nn.Linear(hidden_features,
+                             in_features,
+                             weight_attr=w_attr_2,
+                             bias_attr=b_attr_2)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.XavierUniform())
+        bias_attr = paddle.ParamAttr(initializer=paddle.nn.initializer.Normal(std=1e-6))
+        return weight_attr, bias_attr
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        return x
+
+
+def img2windows(img, h_split, w_split):
+    """Convert input tensor into split stripes
+
+    Args:
+        img: tensor, image tensor with shape [B, C, H, W]
+        h_split: int, splits width in height direction
+        w_split: int, splits width in width direction
+    Returns:
+        out: tensor, splitted image
+    """
+    B, C, H, W = img.shape
+    out = img.reshape([B, C, H // h_split, h_split, W // w_split, w_split])
+    out = out.transpose([0, 2, 4, 3, 5, 1]) # [B, H//h_split, W//w_split, h_split, w_split, C]
+    out = out.reshape([-1, h_split * w_split, C]) # [B, H//h_split, W//w_split, h_split*w_split, C]
+    return out
+
+
+def windows2img(img_splits, h_split, w_split, img_h, img_w):
+    """Convert splitted stripes back
+
+    Args:
+        img_splits: tensor, image tensor with shape [B, C, H, W]
+        h_split: int, splits width in height direction
+        w_split: int, splits width in width direction
+        img_h: int, original tensor height
+        img_w: int, original tensor width
+    Returns:
+        img: tensor, original tensor
+    """
+    #print("h_split={}, w_split={}, img_h={}, img_w={}".format(h_split, w_split, img_h, img_w))
+    B = int(img_splits.shape[0] / (img_h / h_split * img_w / w_split))
+    #print("img_splits.shape:", img_splits.shape)
+    img = img_splits.reshape([B, img_h // h_split, img_w // w_split, h_split, w_split, -1])
+    img = img.transpose([0, 1, 3, 2, 4, 5]) #[B,img_h//h_split, h_split, img_w//w_split, w_split,C]
+    img = img.reshape([B, img_h, img_w, -1]) # [B, img_h, img_w, C]
+    return img
+
+
+class LePEAttention(nn.Layer):
+    """Cross Shaped Window self-attention with Locally enhanced positional encoding"""
+    def __init__(self,
+                 dim,
+                 h_split=7,
+                 w_split=7,
+                 num_heads=8,
+                 attention_dropout=0.,
+                 dropout=0.,
+                 qk_scale=None):
+        super().__init__()
+        self.dim = dim
+        self.num_heads = num_heads
+        self.dim_head = dim // num_heads
+        self.scale = qk_scale or self.dim_head ** -0.5
+        self.h_split = h_split
+        self.w_split = w_split
+
+        self.get_v = nn.Conv2D(in_channels=dim,
+                               out_channels=dim,
+                               kernel_size=3,
+                               stride=1,
+                               padding=1,
+                               groups=dim)
+
+        self.softmax = nn.Softmax(axis=-1)
+        self.attn_dropout = nn.Dropout(attention_dropout)
+
+    def im2cswin(self, x):
+        B, HW, C = x.shape
+        H = W = int(np.sqrt(HW))
+        x = x.transpose([0, 2, 1]) # [B, C, H*W]
+        x = x.reshape([B, C, H, W]) # [B, C, H, W]
+        x = img2windows(x, self.h_split, self.w_split)
+        x = x.reshape([-1, self.h_split * self.w_split, self.num_heads, self.dim_head])
+        x = x.transpose([0, 2, 1, 3])
+        return x
+
+    def get_lepe(self, x, func):
+        """Locally Enhanced Positional Encoding (LePE)
+        This module applies a depthwise conv on V and returns the lepe
+        Args:
+            x: tensor, the input tensor V
+            func: nn.Layer, a depth wise conv of kernel 3 stride 1 and padding 1
+        """
+        B, HW, C = x.shape
+        H = W = int(np.sqrt(HW))
+        h_split = self.h_split
+        w_split = self.w_split
+
+        x = x.transpose([0, 2, 1]) # [B, C, H*W]
+        x = x.reshape([B, C, H, W]) # [B, C, H, W]
+        x = x.reshape([B, C, H // h_split, h_split, W // w_split, w_split])
+        x = x.transpose([0, 2, 4, 1, 3, 5]) # [B, H//h_split, W//w_split, C, h_split, w_split]
+        x = x.reshape([-1, C, h_split, w_split]) # [B*(H//h_split)*(W//w_split), C, h_split, w_split]
+
+        lepe = func(x) # depth wise conv does not change shape
+        #lepe = lepe.reshape([-1, self.num_heads, C // self.num_heads, h_split * w_split])
+        lepe = lepe.reshape([-1, self.num_heads, self.dim_head, h_split * w_split])
+        lepe = lepe.transpose([0, 1, 3, 2]) # [B, num_heads, h_spllit*w_split, dim_head]
+
+        x = x.reshape([-1, self.num_heads, self.dim_head, h_split * w_split])
+        x = x.transpose([0, 1, 3, 2]) # [B, num_heads, h_split*wsplit, dim_head]
+        return x, lepe
+
+    def forward(self, q, k, v):
+        B, HW, C = q.shape
+        H = W = int(np.sqrt(HW))
+        q = self.im2cswin(q)
+        k = self.im2cswin(k)
+        v, lepe = self.get_lepe(v, self.get_v)
+
+        q = q * self.scale
+        attn = paddle.matmul(q, k, transpose_y=True)
+        attn = self.softmax(attn)
+        attn = self.attn_dropout(attn)
+
+        z = paddle.matmul(attn, v)
+        z = z + lepe
+        z = z.transpose([0, 2, 1, 3])
+        z = z.reshape([-1, self.h_split * self.w_split, C])
+
+        z = windows2img(z, self.h_split, self.w_split, H, W)
+        z = z.reshape([B, -1, C])
+        return z
+
+
+class CSwinBlock(nn.Layer):
+    """CSwin Block
+
+    CSwin block contains a LePE attention modual, a linear projection,
+    a mlp layer, and related norms layers. In the first 3 stages, the
+    LePE attention moduals used 2 branches, where horizontal and
+    vertical split stripes are used for self attention and a concat
+    op is applied to combine the outputs. The last stage does not
+    have branche in LePE attention.
+
+    Args:
+        dim: int, input feature dimension
+        input_resolution: int, input feature spatial size.
+        num_heads: int, num of attention heads in current stage
+        split_size: int, the split size in current stage
+        mlp_ratio: float, mlp ratio, mlp_hidden_dim = mlp_ratio * mlp_in_dim, default: 4.
+        qkv_bias: bool, if set True, qkv projection will have bias, default: True
+        qk_scale: float, if set, replace the orig qk_scale (dim_head ** -0.5), default: None
+        dropout: float, dropout rate for linear projection, default: 0
+        attention_dropout: float, dropout rate for attention, default: 0
+        droppath: float, drop path rate, default: 0
+        split_heads: bool, if True, split heads is applied (True for 1,2,3 stages), default: True
+    """
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 num_heads,
+                 split_size=7,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attention_dropout=0.,
+                 dropout=0.,
+                 droppath=0.,
+                 split_heads=True):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = (input_resolution, input_resolution)
+        self.num_heads = num_heads
+        self.dim_head = dim // num_heads
+        self.mlp_ratio = mlp_ratio
+        self.split_size = split_size
+        self.norm1 = nn.LayerNorm(dim)
+        self.qkv = nn.Linear(dim, dim * 3, bias_attr=qkv_bias)
+        self.attns = nn.LayerList()
+        self.split_heads = split_heads
+
+        
+        num_branches = 2 if split_heads else 1
+        pad_r, pad_b = self.get_pad_rb()
+        if split_heads: # first 3 stages
+            #splits = [self.input_resolution[0], self.split_size] # horizantal splits
+            splits = [self.input_resolution[0] + pad_b, self.split_size] # horizantal splits
+        else: # last stage
+            #splits = [self.input_resolution[0], self.input_resolution[0]]
+            splits = [self.input_resolution[0] + pad_b, self.input_resolution[1] + pad_r]
+        for _ in range(num_branches):
+            attn = LePEAttention(dim=dim//num_branches,
+                                 h_split=splits[0],
+                                 w_split=splits[1],
+                                 num_heads=num_heads//num_branches,
+                                 qk_scale=qk_scale,
+                                 attention_dropout=attention_dropout,
+                                 dropout=dropout)
+            self.attns.append(copy.deepcopy(attn))
+            # switch splits from horizantal to vertical
+            # NOTE: may need to change for different H and W
+            splits[0], splits[1] = splits[1], splits[0]
+
+        self.proj = nn.Linear(dim, dim)
+        self.drop_path = DropPath(droppath) if droppath > 0. else Identity()
+        self.norm2 = nn.LayerNorm(dim)
+        self.mlp = Mlp(in_features=dim,
+                       hidden_features=int(dim * mlp_ratio),
+                       dropout=dropout)
+
+    def chunk_qkv(self, x, chunks=1, axis=-1):
+        x = x.chunk(chunks, axis=axis)
+        return x
+
+    def get_pad_rb(self,):
+        H, W = self.input_resolution
+        pad_r =  (self.split_size - H % self.split_size) % self.split_size
+        pad_b =  (self.split_size - W % self.split_size) % self.split_size
+        return pad_r, pad_b
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, HW, C = x.shape
+        # cswin attention
+        h = x
+        x = self.norm1(x)
+        
+        # pad feature maps to multiples of windown size
+        # add these codes for semantic segmentation or other downstream tasks (Rosun)
+        s = int(np.sqrt(HW))
+        x = x.reshape([B, s, s, C])
+        pad_l = pad_t = 0
+        #pad_r =  (self.split_size - s % self.split_size) % self.split_size
+        #pad_b =  (self.split_size - s % self.split_size) % self.split_size
+        pad_r, pad_b = self.get_pad_rb()
+        x = x.transpose([0, 3, 1, 2]) # (B,C,H,W)
+        x = F.pad(x, [pad_l, pad_r, pad_t, pad_b]) 
+        x = x.transpose([0, 2, 3, 1]) # (B,Hp,Wp,C)
+        _, Hp, Wp, _ = x.shape
+        x = x.reshape([B, Hp*Wp, C])
+
+        qkv = self.qkv(x).chunk(3, axis=-1) # qkv is a tuple of [q, k, v]
+        chunks = 2 if self.split_heads else 1
+        # qkv[0].shape = [B, H * W, embd_dim]
+        q, k, v = map(self.chunk_qkv, qkv, (chunks,) * 3) # map requries list/tuple inputs
+        if self.split_heads: # first 3 stages
+            h_attn = self.attns[0](q[0], k[0], v[0])
+            w_attn = self.attns[1](q[1], k[1], v[1])
+            attn = paddle.concat([h_attn, w_attn], axis=2)
+        else: # last stage
+            attn = self.attns[0](q[0], k[0], v[0])
+        attn = self.proj(attn)
+        # remove padding (Rosun)
+        if pad_r > 0 or pad_b > 0:
+            attn = attn.reshape([B,Hp,Wp,C])
+            attn = attn[:, :H, :W, :]
+            attn = attn.reshape([B, H*W, C])
+
+        attn = self.drop_path(attn)
+        x = h + attn
+        # mlp + residual
+        h = x
+        x = self.norm2(x)
+        x = self.mlp(x)
+        x = self.drop_path(x)
+        x = h + x
+        return x
+
+
+class MergeBlock(nn.Layer):
+    def __init__(self, dim_in, dim_out):
+        super().__init__()
+        self.conv = nn.Conv2D(in_channels=dim_in,
+                              out_channels=dim_out,
+                              kernel_size=3,
+                              stride=2,
+                              padding=1)
+        self.norm = nn.LayerNorm(dim_out)
+
+    def forward(self, x):
+        B, HW, C = x.shape
+        H = W = int(np.sqrt(HW))
+        x = x.transpose([0, 2, 1]) # [B, C, HW]
+        x = x.reshape([B, C, H, W]) # [B, C, H, W]
+        x = self.conv(x)
+        new_shape = [x.shape[0], x.shape[1], -1] # [B, C', H*W]
+        x = x.reshape(new_shape) # [B, C', H*W]
+        x = x.transpose([0, 2, 1]) # [B, H*W, C']
+        x = self.norm(x)
+        return x
+
+
+class CSwinStage(nn.Layer):
+    """ CSwin Stage, each stage contains multi blocks
+
+    CSwin has 4 stages, the first 3 stages are using head split. The last
+    stage does not have head split. There is a merge block between each
+    2 stages.
+
+    Args:
+        dim: int, input feature dimension
+        depth: int, number of blocks in current stage
+        num_heads: int, num of attention heads in current stage
+        split_size: int, the split size in current stage
+        mlp_ratio: float, mlp ratio, mlp_hidden_dim = mlp_ratio * mlp_in_dim, default: 4.
+        qkv_bias: bool, if set True, qkv projection will have bias, default: True
+        qk_scale: float, if set, replace the orig qk_scale (dim_head ** -0.5), default: None
+        dropout: float, dropout rate for linear projection, default: 0
+        attention_dropout: float, dropout rate for attention, default: 0
+        droppath: float, drop path rate, default: 0
+        last_stage: bool, if current stage is the last stage, default: False
+    """
+    def __init__(self,
+                 dim,
+                 input_resolution,
+                 depth,
+                 num_heads,
+                 split_size,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 dropout=0.,
+                 attention_dropout=0.,
+                 droppath=0.,
+                 last_stage=False):
+        super().__init__()
+        self.blocks = nn.LayerList()
+        for i in range(depth):
+            block = CSwinBlock(dim=dim,
+                               input_resolution=input_resolution,
+                               num_heads=num_heads,
+                               split_size=split_size,
+                               mlp_ratio=mlp_ratio,
+                               qkv_bias=qkv_bias,
+                               qk_scale=qk_scale,
+                               attention_dropout=attention_dropout,
+                               dropout=dropout,
+                               droppath=droppath[i] if isinstance(droppath, list) else droppath,
+                               split_heads=not last_stage)
+            self.blocks.append(copy.deepcopy(block))
+        # last stage does not need merge layer
+        self.merge = MergeBlock(dim_in=dim, dim_out=dim * 2) if not last_stage else Identity()
+
+    def forward(self, x):
+        for block in self.blocks:
+            x = block(x)
+        x_down = self.merge(x)
+        return x, x_down
+
+
+class CSwinTransformer(nn.Layer):
+    """CSwin Transformer class
+    Args:
+        image_size: int, input image size, default: 224
+        patch_stride: int, stride for patch embedding, default: 4
+        in_channels: int, num of channels of input image, default: 3
+        num_classes: int, num of classes, default: 1000
+        embed_dim: int, embedding dim (patch embed out dim), default: 96
+        depths: list/tuple(int), number of blocks in each stage, default: [2, 4, 32, 2]
+        splits: list/tuple(int), the split number in each stage, default: [1, 2, 7, 7]
+        num_heads: list/tuple(int), num of attention heads in each stage, default: [4, 8, 16, 32]
+        mlp_ratio: float, mlp ratio, mlp_hidden_dim = mlp_ratio * mlp_in_dim, default: 4.
+        qkv_bias: bool, if set True, qkv projection will have bias, default: True
+        qk_scale: float, if set, replace the orig qk_scale (dim_head ** -0.5), default: None
+        dropout: float, dropout rate for linear projection, default: 0
+        attention_dropout: float, dropout rate for attention, default: 0
+        droppath: float, drop path rate, default: 0
+    """
+    def __init__(self, config):
+        super(CSwinTransformer, self).__init__()
+        image_size = config.DATA.CROP_SIZE
+        patch_stride = config.MODEL.TRANS.PATCH_SIZE
+        in_channels = config.MODEL.TRANS.IN_CHANNELS
+        num_classes = config.DATA.NUM_CLASSES
+        embed_dim = config.MODEL.TRANS.EMBED_DIM
+        depths = config.MODEL.TRANS.STAGE_DEPTHS
+        splits = config.MODEL.TRANS.SPLIT_SIZES
+        num_heads = config.MODEL.TRANS.NUM_HEADS
+        mlp_ratio = config.MODEL.TRANS.MLP_RATIO
+        qkv_bias = config.MODEL.TRANS.QKV_BIAS
+        qk_scale = config.MODEL.TRANS.QK_SCALE
+        dropout = config.MODEL.DROPOUT
+        attention_dropout = config.MODEL.ATTENTION_DROPOUT
+        droppath = config.MODEL.DROP_PATH 
+        self.out_indices = config.MODEL.ENCODER.OUT_INDICES
+
+        # token embedding
+        self.patch_embedding = PatchEmbedding(patch_stride=patch_stride,
+                                              in_channels=in_channels,
+                                              embed_dim=embed_dim)
+        # drop path decay by stage
+        depth_decay = [x.item() for x in paddle.linspace(0, droppath, sum(depths))]
+        dim = embed_dim
+        resolution = image_size[0] // 4
+        self.stages = nn.LayerList()
+        num_stages = len(depths)
+        # construct CSwin stages: each stage has multiple blocks
+        for stage_idx in range(num_stages):
+            stage = CSwinStage(dim=dim,
+                               input_resolution=resolution,
+                               depth=depths[stage_idx],
+                               num_heads=num_heads[stage_idx],
+                               split_size=splits[stage_idx],
+                               mlp_ratio=mlp_ratio,
+                               qkv_bias=qkv_bias,
+                               qk_scale=qk_scale,
+                               dropout=dropout,
+                               attention_dropout=attention_dropout,
+                               droppath=depth_decay[
+                                   sum(depths[:stage_idx]):sum(depths[:stage_idx+1])],
+                               last_stage=stage_idx == num_stages-1)
+            self.stages.append(stage)
+            if stage_idx != num_stages - 1:
+                dim = dim * 2
+                resolution = resolution // 2
+
+    def forward(self, x):
+        x = self.patch_embedding(x)
+        outs = []
+        for idx in range(len(self.stages)):
+            x_stage, x = self.stages[idx](x)
+            if idx in self.out_indices:
+                outs.append(x_stage)
+        return outs
diff --git a/semantic_segmentation/src/models/backbones/focal_transformer.py b/semantic_segmentation/src/models/backbones/focal_transformer.py
new file mode 100644
index 00000000..a9329ac6
--- /dev/null
+++ b/semantic_segmentation/src/models/backbones/focal_transformer.py
@@ -0,0 +1,973 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.nn import functional as F
+from .swin_transformer import Identity, DropPath, Mlp, windows_partition, windows_reverse
+import sys
+
+
+def window_partition_noreshape(x, window_size):
+    r"""window_partition_noreshape
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+    Returns:
+        windows: (B, num_windows_h, num_windows_w, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.reshape((B, H // window_size, window_size, W // window_size, window_size, C))
+    windows = x.transpose((0, 1, 3, 2, 4, 5))
+    return windows
+
+
+def get_relative_position_index(q_windows, k_windows):
+    r"""
+    Args:
+        q_windows: tuple (query_window_height, query_window_width)
+        k_windows: tuple (key_window_height, key_window_width)
+    Returns:
+        relative_position_index:
+            query_window_height*query_window_width, key_window_height*key_window_width
+    """
+    # get pair-wise relative position index for each token inside the window
+    coords_h_q = paddle.arange(q_windows[0])
+    coords_w_q = paddle.arange(q_windows[1])
+    coords_q = paddle.stack(paddle.meshgrid([coords_h_q, coords_w_q]))  # 2, Wh_q, Ww_q
+
+    coords_h_k = paddle.arange(k_windows[0])
+    coords_w_k = paddle.arange(k_windows[1])
+    coords_k = paddle.stack(paddle.meshgrid([coords_h_k, coords_w_k]))  # 2, Wh, Ww
+
+    coords_flatten_q = paddle.flatten(coords_q, 1)  # 2, Wh_q*Ww_q
+    coords_flatten_k = paddle.flatten(coords_k, 1)  # 2, Wh_k*Ww_k
+
+    coords_flatten_q = paddle.unsqueeze(coords_flatten_q, axis=-1) # 2, Wh_q*Ww_q, 1
+    coords_flatten_k = paddle.unsqueeze(coords_flatten_k, axis=-2) # 2, 1, Ww_k*Ww_k
+
+    relative_coords = coords_flatten_q - coords_flatten_k  # 2, Wh_q*Ww_q, Wh_k*Ww_k
+    relative_coords = relative_coords.transpose((1, 2, 0))  # Wh_q*Ww_q, Wh_k*Ww_k, 2
+    relative_coords[:, :, 0] += k_windows[0] - 1  # shift to start from 0
+    relative_coords[:, :, 1] += k_windows[1] - 1
+    relative_coords[:, :, 0] *= (q_windows[1] + k_windows[1]) - 1
+    relative_position_index = relative_coords.sum(-1)  #  Wh_q*Ww_q, Wh_k*Ww_k
+    return relative_position_index
+
+
+class WindowAttention(nn.Layer):
+    r""" Window based multi-head self attention (W-MSA) module with relative position bias.
+    Args:
+        dim (int): Number of input channels.
+        expand_size (int): The expand size at focal level 1.
+        window_size (tuple[int]): The height and width of the window.
+        focal_window (int): Focal region size.
+        focal_level (int): Focal attention level.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value.
+                                    Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set
+        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0
+        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
+        pool_method (str): window pooling method. Default: none
+    """
+    def __init__(self, dim, expand_size, window_size, focal_window,
+                    focal_level, num_heads, qkv_bias=True, qk_scale=None,
+                    attn_drop=0., proj_drop=0., pool_method="none"):
+        super().__init__()
+        self.dim = dim
+        self.expand_size = expand_size
+        self.window_size = window_size  # Wh, Ww
+        self.pool_method = pool_method
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+
+        weight_attr, bias_attr = self._init_weights()
+
+        # define a parameter table of relative position bias for each window
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads),
+            dtype=np.float32, is_bias=True)  # 2*Wh-1 * 2*Ww-1, nH
+
+        # get pair-wise relative position index for each token inside the window
+        coords_h = paddle.arange(self.window_size[0])
+        coords_w = paddle.arange(self.window_size[1])
+        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
+
+        coords_flatten_l = paddle.unsqueeze(coords_flatten, axis=-1) # 2, Wh*Ww, 1
+        coords_flatten_r = paddle.unsqueeze(coords_flatten, axis=-2) # 2, 1, Wh*Ww
+        relative_coords = coords_flatten_l - coords_flatten_r  # 2, Wh*Ww, Wh*Ww
+
+        relative_coords = relative_coords.transpose((1, 2, 0))  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer("relative_position_index", relative_position_index)
+
+        if self.expand_size > 0 and focal_level > 0:
+            # define a parameter table of position bias between window
+            # and its fine-grained surroundings
+            self.window_size_of_key = self.window_size[0] * \
+                self.window_size[1] if self.expand_size == 0 else \
+                (4 * self.window_size[0] * self.window_size[1] - 4 * \
+                (self.window_size[0] -  self.expand_size) * \
+                (self.window_size[0] -  self.expand_size))
+
+            self.relative_position_bias_table_to_neighbors = paddle.create_parameter(
+                        shape=(1, num_heads,
+                        self.window_size[0] * self.window_size[1], self.window_size_of_key),
+                        dtype=np.float32, is_bias=True,
+                        attr=nn.initializer.TruncatedNormal(std=.02))  # Wh*Ww, nH, nSurrounding
+
+            # get mask for rolled k and rolled v
+            mask_tl = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_tl[:-self.expand_size, :-self.expand_size] = 0
+            mask_tr = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_tr[:-self.expand_size, self.expand_size:] = 0
+            mask_bl = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_bl[self.expand_size:, :-self.expand_size] = 0
+            mask_br = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_br[self.expand_size:, self.expand_size:] = 0
+            mask_rolled = paddle.stack((mask_tl, mask_tr, mask_bl, mask_br), 0).flatten(0)
+            self.register_buffer("valid_ind_rolled", paddle.flatten(mask_rolled.nonzero()))
+
+        if pool_method != "none" and focal_level > 1:
+            self.relative_position_bias_table_to_windows = nn.ParameterList()
+            self.unfolds = nn.LayerList()
+
+            # build relative position bias between local patch and pooled windows
+            for k in range(focal_level-1):
+                stride = 2**k
+                kernel_size = 2*(self.focal_window // 2) + 2**k + (2**k-1)
+                # define unfolding operations
+                self.unfolds.append(
+                    nn.Unfold(
+                    kernel_sizes=[kernel_size, kernel_size],
+                    strides=stride, paddings=kernel_size // 2)
+                )
+
+                # define relative position bias table
+                relative_position_bias_table_to_windows = paddle.create_parameter(
+                        shape=(self.num_heads,
+                        (self.window_size[0] + self.focal_window + 2**k - 2) * \
+                        (self.window_size[1] + self.focal_window + 2**k - 2), ),
+                        dtype=np.float32, is_bias=True,
+                        attr=nn.initializer.TruncatedNormal(std=.02))  # Wh*Ww, nH, nSurrounding
+                self.relative_position_bias_table_to_windows.append(
+                            relative_position_bias_table_to_windows)
+
+                # define relative position bias index
+                relative_position_index_k = get_relative_position_index(self.window_size,
+                                            (self.focal_window + 2**k - 1,
+                                            self.focal_window + 2**k - 1))
+                self.register_buffer("relative_position_index_{}".format(k),
+                                                    relative_position_index_k)
+
+                # define unfolding index for focal_level > 0
+                if k > 0:
+                    mask = paddle.zeros(kernel_size, kernel_size)
+                    mask[(2**k)-1:, (2**k)-1:] = 1
+                    self.register_buffer("valid_ind_unfold_{}".format(k),
+                                paddle.flatten(mask.flatten(0).nonzero()))
+
+        self.qkv = nn.Linear(dim, dim * 3, weight_attr=weight_attr,
+                             bias_attr=bias_attr if qkv_bias else False)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def forward(self, x_all, mask_all=None):
+        """
+        Args:
+            x_all (list[Tensors]): input features at different granularity
+            mask_all (list[Tensors/None]): masks for input features at different granularity
+        """
+        x = x_all[0]
+
+        B, nH, nW, C = x.shape
+        qkv = self.qkv(x).reshape((B, nH, nW, 3, C)).transpose((3, 0, 1, 2, 4))
+        q, k, v = qkv[0], qkv[1], qkv[2]  # B, nH, nW, C
+
+
+        # partition q map
+        q_windows = windows_partition(q, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+        k_windows = windows_partition(k, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+        v_windows = windows_partition(v, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+
+        if self.expand_size > 0 and self.focal_level > 0:
+            k_tl = paddle.roll(k, shifts=(-self.expand_size, -self.expand_size), axis=(1, 2))
+            v_tl = paddle.roll(v, shifts=(-self.expand_size, -self.expand_size), axis=(1, 2))
+
+            k_tr = paddle.roll(k, shifts=(-self.expand_size, self.expand_size), axis=(1, 2))
+            v_tr = paddle.roll(v, shifts=(-self.expand_size, self.expand_size), axis=(1, 2))
+
+            k_bl = paddle.roll(k, shifts=(self.expand_size, -self.expand_size), axis=(1, 2))
+            v_bl = paddle.roll(v, shifts=(self.expand_size, -self.expand_size), axis=(1, 2))
+
+            k_br = paddle.roll(k, shifts=(self.expand_size, self.expand_size), axis=(1, 2))
+            v_br = paddle.roll(v, shifts=(self.expand_size, self.expand_size), axis=(1, 2))
+
+
+            k_tl_windows = windows_partition(k_tl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_tr_windows = windows_partition(k_tr, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_bl_windows = windows_partition(k_bl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_br_windows = windows_partition(k_br, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+
+            v_tl_windows = windows_partition(v_tl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_tr_windows = windows_partition(v_tr, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_bl_windows = windows_partition(v_bl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_br_windows = windows_partition(v_br, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+
+            k_rolled = paddle.concat((k_tl_windows, k_tr_windows,
+                       k_bl_windows, k_br_windows), 1).transpose((0, 2, 1, 3))
+            v_rolled = paddle.concat((v_tl_windows, v_tr_windows,
+                       v_bl_windows, v_br_windows), 1).transpose((0, 2, 1, 3))
+
+            # mask out tokens in current window
+            k_rolled = paddle.gather(k_rolled, self.valid_ind_rolled.flatten(), axis=2)
+            v_rolled = paddle.gather(v_rolled, self.valid_ind_rolled.flatten(), axis=2)
+            k_rolled = paddle.concat((k_windows, k_rolled), 2)
+            v_rolled = paddle.concat((v_windows, v_rolled), 2)
+        else:
+            k_rolled = k_windows
+            v_rolled = v_windows
+
+        if self.pool_method != "none" and self.focal_level > 1:
+            k_pooled = []
+            v_pooled = []
+            for k in range(self.focal_level-1):
+                stride = 2**k
+                x_window_pooled = x_all[k+1]  # B, nWh, nWw, C
+                nWh, nWw = x_window_pooled.shape[1:3] 
+
+                # generate mask for pooled windows
+                mask = paddle.ones(shape=(nWh, nWw)).astype(x_window_pooled.dtype)
+                unfolded_mask = self.unfolds[k](mask.unsqueeze(0).unsqueeze(1)).reshape((
+                    1, 1, self.unfolds[k].kernel_sizes[0],
+                    self.unfolds[k].kernel_sizes[1], -1)).transpose((0, 4, 2, 3, 1)).\
+                    reshape((nWh*nWw // stride // stride, -1, 1))
+
+                if k > 0:
+                    valid_ind_unfold_k = getattr(self, "valid_ind_unfold_{}".format(k))
+                    unfolded_mask = unfolded_mask[:, valid_ind_unfold_k]
+
+                x_window_masks = unfolded_mask.flatten(1).unsqueeze(0)
+                # from numpy to paddle
+                x_window_masks = x_window_masks.numpy()
+                x_window_masks[x_window_masks==0] = -100.0
+                x_window_masks[x_window_masks>0] = 0.0
+                x_window_masks = paddle.to_tensor(x_window_masks.astype(np.float32))         
+                mask_all[k+1] = x_window_masks
+
+                # generate k and v for pooled windows                
+                qkv_pooled = self.qkv(x_window_pooled).reshape((B, nWh, nWw, 3, C)).transpose(
+                                                                              (3, 0, 4, 1, 2))
+                k_pooled_k, v_pooled_k = qkv_pooled[1], qkv_pooled[2]  # B, C, nWh, nWw
+
+                # (B x (nH*nW)) x nHeads x (unfold_wsize x unfold_wsize) x head_dim
+                k_pooled_k = self.unfolds[k](k_pooled_k).reshape((
+                            B, C, self.unfolds[k].kernel_sizes[0],
+                            self.unfolds[k].kernel_sizes[1], -1)).transpose(
+                            (0, 4, 2, 3, 1)).reshape((-1,
+                            self.unfolds[k].kernel_sizes[0]*self.unfolds[k].kernel_sizes[1],
+                            self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+                v_pooled_k = self.unfolds[k](v_pooled_k).reshape((
+                            B, C, self.unfolds[k].kernel_sizes[0],
+                            self.unfolds[k].kernel_sizes[1], -1)).transpose(
+                            (0, 4, 2, 3, 1)).reshape((-1,
+                            self.unfolds[k].kernel_sizes[0]*self.unfolds[k].kernel_sizes[1],
+                            self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+
+                if k > 0:
+                    k_pooled_k = k_pooled_k[:, :, valid_ind_unfold_k]
+                    v_pooled_k = v_pooled_k[:, :, valid_ind_unfold_k]
+
+                k_pooled += [k_pooled_k]
+                v_pooled += [v_pooled_k]
+            k_all = paddle.concat([k_rolled] + k_pooled, 2)
+            v_all = paddle.concat([v_rolled] + v_pooled, 2)
+        else:
+            k_all = k_rolled
+            v_all = v_rolled
+
+        N = k_all.shape[-2]
+        q_windows = q_windows * self.scale
+        # B*nW, nHead, window_size*window_size, focal_window_size*focal_window_size
+        attn = (paddle.mm(q_windows, k_all.transpose((0, 1, 3, 2))))
+
+        window_area = self.window_size[0] * self.window_size[1]        
+        window_area_rolled = k_rolled.shape[2]
+
+        # add relative position bias for tokens inside window
+        # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.flatten()].reshape((
+            self.window_size[0] * self.window_size[1], 
+            self.window_size[0] * self.window_size[1], -1))
+        # nH, Wh*Ww, Wh*Ww
+        relative_position_bias = relative_position_bias.transpose((2, 0, 1))
+        attn[:, :, :window_area, :window_area] = attn[:, :, :window_area, :window_area] + \
+                                                 relative_position_bias.unsqueeze(0)
+
+        # add relative position bias for patches inside a window
+        if self.expand_size > 0 and self.focal_level > 0:
+            attn[:, :, :window_area, window_area:window_area_rolled] = attn[:, :, :window_area,
+                window_area:window_area_rolled] + self.relative_position_bias_table_to_neighbors
+
+        if self.pool_method != "none" and self.focal_level > 1:
+            # add relative position bias for different windows in an image        
+            offset = window_area_rolled
+            for k in range(self.focal_level-1):
+                # add relative position bias
+                relative_position_index_k = getattr(self, 'relative_position_index_{}'.format(k))
+                relative_position_bias_to_windows = self.relative_position_bias_table_to_windows[k]
+                relative_position_bias_to_windows = paddle.gather(
+                    relative_position_bias_to_windows, relative_position_index_k.flatten(),
+                    axis=1).reshape((-1, self.window_size[0] * self.window_size[1],
+                    (self.focal_window+2**k-1)**2,
+                )) # nH, NWh*NWw,focal_region*focal_region
+                attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] = \
+                    attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] + \
+                    relative_position_bias_to_windows.unsqueeze(0)
+                # add attentional mask
+                if mask_all[k+1] is not None:
+                    attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] = \
+                                    attn[:, :, :window_area, offset:(offset + \
+                                    (self.focal_window+2**k-1)**2)] + \
+                                    paddle.stack([mask_all[k+1].unsqueeze(-2).unsqueeze(-2)] * \
+                                    (attn.shape[0] // mask_all[k+1].shape[1]), axis=0).\
+                                    reshape((-1, 1, 1, mask_all[k+1].shape[-1]))
+                offset += (self.focal_window+2**k-1)**2
+
+        if mask_all[0] is not None:
+            nW = mask_all[0].shape[0]
+            attn = attn.reshape((attn.shape[0] // nW, nW, self.num_heads, window_area, N))
+            attn[:, :, :, :, :window_area] = attn[:, :, :, :, :window_area] + \
+                                             mask_all[0].unsqueeze(0).unsqueeze(2)
+            attn = attn.reshape((-1, self.num_heads, window_area, N))
+            attn = self.softmax(attn)
+        else:          
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+        x = paddle.mm(attn, v_all).transpose((0, 2, 1, 3)).reshape(
+                                   (attn.shape[0], window_area, C))
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class FocalTransformerBlock(nn.Layer):
+    r""" Focal Transformer Block.
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resulotion.
+        num_heads (int): Number of attention heads.
+        window_size (int): Window size.
+        expand_size (int): expand size at first focal level (finest level).
+        shift_size (int): Shift size for SW-MSA.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value.
+                                   Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float, optional): Stochastic depth rate. Default: 0.0
+        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+        pool_method (str): window pooling method. Default: none, options: [none|fc|conv]
+        focal_level (int): number of focal levels. Default: 1.
+        focal_window (int): region size of focal attention. Default: 1
+        use_layerscale (bool): whether use layer scale for training stability. Default: False
+        layerscale_value (float): scaling value for layer scale. Default: 1e-4
+    """
+    def __init__(self, dim, input_resolution, num_heads, window_size=7, expand_size=0,
+                 shift_size=0, mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0.,
+                 attn_drop=0., drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,
+                 pool_method="none", focal_level=1, focal_window=1, use_layerscale=False,
+                 layerscale_value=1e-4):
+        super(FocalTransformerBlock, self).__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.expand_size = expand_size
+        self.mlp_ratio = mlp_ratio
+        self.pool_method = pool_method
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        self.use_layerscale = use_layerscale
+
+        weight_attr, bias_attr = self._init_weights()
+
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't partition windows
+            self.expand_size = 0
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"
+
+        self.window_size_glo = self.window_size
+
+        self.pool_layers = nn.LayerList()
+        if self.pool_method != "none":
+            for k in range(self.focal_level-1):
+                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
+                if self.pool_method == "fc":
+                    self.pool_layers.append(nn.Linear(window_size_glo * window_size_glo, 1,
+                                            weight_attr=weight_attr, bias_attr=bias_attr))
+                    self.pool_layers[len(self.pool_layers)-1].weight.set_value(
+                        paddle.full_like(self.pool_layers[len(self.pool_layers)-1].weight,
+                        1./(window_size_glo * window_size_glo))
+                    )
+                    self.pool_layers[len(self.pool_layers)-1].bias.set_value(
+                        paddle.full_like(self.pool_layers[len(self.pool_layers)-1].bias, 0)
+                    )
+                    
+                elif self.pool_method == "conv":
+                    self.pool_layers.append(nn.Conv2D(dim, dim,
+                                            kernel_size=window_size_glo,
+                                            stride=window_size_glo, groups=dim))
+
+        self.norm1 = norm_layer(dim,
+                     weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                     bias_attr=bias_attr)
+
+        self.attn = WindowAttention(
+            dim, expand_size=self.expand_size,
+            window_size=(self.window_size, self.window_size),
+            focal_window=focal_window, focal_level=focal_level,
+            num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+            attn_drop=attn_drop,proj_drop=drop, pool_method=pool_method)
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
+        self.norm2 = norm_layer(dim,
+                     weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                     bias_attr=bias_attr)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, dropout=drop)
+
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = paddle.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            # nW, window_size, window_size, 1
+            mask_windows = windows_partition(img_mask, self.window_size)
+            mask_windows = mask_windows.reshape((-1, self.window_size * self.window_size))
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            # from numpy to paddle
+            attn_mask = attn_mask.numpy()
+            attn_mask[attn_mask!=0] = -100.0
+            attn_mask[attn_mask==0] = 0.0
+            attn_mask = paddle.to_tensor(attn_mask.astype(np.float32))
+        else:
+            attn_mask = None
+        self.register_buffer("attn_mask", attn_mask)
+
+        if self.use_layerscale:
+            self.gamma_1 = paddle.create_parameter(layerscale_value * paddle.ones((dim)))
+            self.gamma_2 = paddle.create_parameter(layerscale_value * paddle.ones((dim)))
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, "input feature has wrong size"
+
+        shortcut = x
+        x = self.norm1(x)
+        x = x.reshape((B, H, W, C))
+
+        # pad feature maps to multiples of window size
+        pad_l = pad_t = 0
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+        if pad_r > 0 or pad_b > 0:
+            x = F.pad(x, [0, 0, pad_l, pad_r, pad_t, pad_b, 0, 0])
+
+        B, H, W, C = x.shape 
+
+        if self.shift_size > 0:
+            shifted_x = paddle.roll(x, shifts=(-self.shift_size, -self.shift_size), axis=(1, 2))
+        else:
+            shifted_x = x
+
+        x_windows_all = [shifted_x]
+        x_window_masks_all = [self.attn_mask]
+
+        if self.focal_level > 1 and self.pool_method != "none":
+            # if we add coarser granularity and the pool method is not none
+            for k in range(self.focal_level-1):
+                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
+                pooled_h = math.ceil(H / self.window_size) * (2 ** k)
+                pooled_w = math.ceil(W / self.window_size) * (2 ** k)
+                H_pool = pooled_h * window_size_glo
+                W_pool = pooled_w * window_size_glo
+
+                x_level_k = shifted_x
+                # trim or pad shifted_x depending on the required size
+                if H > H_pool:
+                    trim_t = (H - H_pool) // 2
+                    trim_b = H - H_pool - trim_t
+                    x_level_k = x_level_k[:, trim_t:-trim_b]
+                elif H < H_pool:
+                    pad_t = (H_pool - H) // 2
+                    pad_b = H_pool - H - pad_t
+                    x_level_k = F.pad(x_level_k, [0, 0, 0, 0, pad_t, pad_b, 0, 0])
+
+                if W > W_pool:
+                    trim_l = (W - W_pool) // 2
+                    trim_r = W - W_pool - trim_l
+                    x_level_k = x_level_k[:, :, trim_l:-trim_r]
+                elif W < W_pool:
+                    pad_l = (W_pool - W) // 2
+                    pad_r = W_pool - W - pad_l
+                    x_level_k = F.pad(x_level_k, [0, 0, pad_l, pad_r, 0, 0, 0, 0])
+
+                # B, nw, nw, window_size, window_size, C
+                x_windows_noreshape = window_partition_noreshape(x_level_k, window_size_glo)
+                nWh, nWw = x_windows_noreshape.shape[1:3]
+
+                if self.pool_method == "mean":
+                    # B, nWh, nWw, C
+                    x_windows_pooled = x_windows_noreshape.mean([3, 4])
+                elif self.pool_method == "max":
+                    # B, nWh, nWw, C
+                    x_windows_pooled = x_windows_noreshape.max(-2)[0].max(-2)[0].reshape(
+                                       (B, nWh, nWw, C))
+                elif self.pool_method == "fc":
+                    # B, nWh, nWw, C, wsize**2
+                    x_windows_noreshape = x_windows_noreshape.reshape((B, nWh, nWw,
+                                          window_size_glo*window_size_glo, C)).transpose(
+                                          (0, 1, 2, 4, 3))
+                    # B, nWh, nWw, C
+                    x_windows_pooled = self.pool_layers[k](x_windows_noreshape).flatten(-2)
+                elif self.pool_method == "conv":
+                    # B * nw * nw, C, wsize, wsize
+                    x_windows_noreshape = x_windows_noreshape.reshape((-1,
+                                          window_size_glo, window_size_glo, C)).transpose(
+                                          (0, 3, 1, 2))
+                    # B, nWh, nWw, C
+                    x_windows_pooled = self.pool_layers[k](x_windows_noreshape).reshape(
+                                       (B, nWh, nWw, C))
+
+                x_windows_all += [x_windows_pooled]
+                x_window_masks_all += [None]
+        
+        # nW*B, window_size*window_size, C
+        attn_windows = self.attn(x_windows_all, mask_all=x_window_masks_all)
+        attn_windows = attn_windows[:, :self.window_size ** 2]
+        
+        x = self.merge_windows_and_ffn(attn_windows, shortcut, B, C, H, W)
+
+        return x
+
+
+    def merge_windows_and_ffn(self, attn_windows, shortcut, B, C, H, W):
+        attn_windows = attn_windows.reshape((-1, self.window_size, self.window_size, C))
+        shifted_x = windows_reverse(attn_windows, self.window_size, H, W)  # B H' W' C
+
+        # reverse cyclic shift
+        x = self.reverse_cyclic_shift(shifted_x)
+        x = x[:, :self.input_resolution[0], :self.input_resolution[1]].reshape((B, -1, C))
+
+        # FFN
+        x = self.ffn(x, shortcut)
+
+        return x
+
+
+    def reverse_cyclic_shift(self, shifted_x):
+        if self.shift_size > 0:
+            x = paddle.roll(shifted_x, shifts=(self.shift_size, self.shift_size), axis=(1, 2))
+        else:
+            x = shifted_x
+        return x
+
+
+    def ffn(self, x, shortcut):
+        x = shortcut + self.drop_path(x if (not self.use_layerscale) else (self.gamma_1 * x))
+        x = x + self.drop_path(self.mlp(self.norm2(x)) if (not self.use_layerscale) else (
+                                                  self.gamma_2 * self.mlp(self.norm2(x))))
+        return x
+
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class BasicLayer(nn.Layer):
+    """ A basic Focal Transformer layer for one stage.
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resolution.
+        depth (int): Number of blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): Local window size.
+        expand_size (int): expand size for focal level 1.
+        expand_layer (str): expand layer. Default: all
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.0.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value.
+                                   Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0
+        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
+        pool_method (str): Window pooling method. Default: none.
+        focal_level (int): Number of focal levels. Default: 1.
+        focal_window (int): region size at each focal level. Default: 1.
+        use_conv_embed (bool): whether use overlapped convolutional patch embedding layer.
+                               Default: False
+        use_shift (bool): Whether use window shift as in Swin Transformer. Default: False
+        use_pre_norm (bool): Whether use pre-norm before patch embedding projection for stability.
+                             Default: False
+        downsample (nn.Module | None, optional): Downsample layer at the end of the layer.
+                             Default: None
+        use_layerscale (bool): Whether use layer scale for stability. Default: False.
+        layerscale_value (float): Layerscale value. Default: 1e-4.
+    """
+    def __init__(self, dim, input_resolution, depth, num_heads, window_size,
+                 expand_size, expand_layer="all", mlp_ratio=4., qkv_bias=True,
+                 qk_scale=None, drop=0., attn_drop=0., drop_path=0., norm_layer=nn.LayerNorm,
+                 pool_method="none", focal_level=1, focal_window=1, use_conv_embed=False,
+                 use_shift=False, use_pre_norm=False,downsample=None, use_layerscale=False,
+                 layerscale_value=1e-4):
+
+        super(BasicLayer, self).__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+
+        if expand_layer == "even":
+            expand_factor = 0
+        elif expand_layer == "odd":
+            expand_factor = 1
+        elif expand_layer == "all":
+            expand_factor = -1
+
+        # build blocks
+        self.blocks = nn.LayerList([
+            FocalTransformerBlock(dim=dim, input_resolution=input_resolution,
+                num_heads=num_heads, window_size=window_size,
+                shift_size=(0 if (i % 2 == 0) else window_size // 2) if use_shift else 0,
+                expand_size=0 if (i % 2 == expand_factor) else expand_size,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop,
+                attn_drop=attn_drop,
+                drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer,
+                pool_method=pool_method,
+                focal_level=focal_level,
+                focal_window=focal_window,
+                use_layerscale=use_layerscale,
+                layerscale_value=layerscale_value)
+            for i in range(depth)])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                img_size=input_resolution, patch_size=2, in_chans=dim, embed_dim=2*dim,
+                use_conv_embed=use_conv_embed, norm_layer=norm_layer, use_pre_norm=use_pre_norm,
+                is_stem=False
+            )
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        for blk in self.blocks:
+            x = blk(x)
+        if self.downsample is not None:
+            # x = x.reshape((x.shape[0], self.input_resolution[0], self.input_resolution[1], -1)).transpose((0, 3, 1, 2))
+            x_down = self.downsample(x.reshape((x.shape[0], self.input_resolution[0], self.input_resolution[1], -1)).transpose((0, 3, 1, 2)))
+            return [x, x_down]
+        return [x, x]
+
+
+class PatchEmbed(nn.Layer):
+    r""" Image to Patch Embedding
+    Args:
+        img_size (int): Image size.  Default: 224.
+        patch_size (int): Patch token size. Default: 4.
+        in_chans (int): Number of input image channels. Default: 3.
+        embed_dim (int): Number of linear projection output channels. Default: 96.
+        use_conv_embed (bool): Wherther use overlapped convolutional embedding layer.
+                               Default: False.
+        norm_layer (nn.Module, optional): Normalization layer. Default: None
+        use_pre_norm (bool): Whether use pre-normalization before projection. Default: False
+        is_stem (bool): Whether current patch embedding is stem. Default: False
+    """
+
+    def __init__(self, img_size=(224, 224), patch_size=4, in_chans=3, embed_dim=96,
+                    use_conv_embed=False, norm_layer=None, use_pre_norm=False, is_stem=False):
+        super().__init__()
+        patch_size = (patch_size, patch_size)
+        patches_resolution = [img_size[0] // patch_size[0], img_size[1] // patch_size[1]]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+        self.use_pre_norm = use_pre_norm
+
+        weight_attr, bias_attr = self._init_weights()
+
+        if use_conv_embed:
+            # if we choose to use conv embedding,
+            # then we treat the stem and non-stem differently
+            if is_stem:
+                kernel_size = 7
+                padding = 2
+                stride = 4
+            else:
+                kernel_size = 3
+                padding = 1
+                stride = 2
+            self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=kernel_size,
+                                  stride=stride, padding=padding)
+        else:
+            self.proj = nn.Conv2D(in_chans, embed_dim,
+                                 kernel_size=patch_size, stride=patch_size)
+
+
+        if self.use_pre_norm:
+            if norm_layer is not None:
+                self.pre_norm = nn.GroupNorm(1, in_chans)
+            else:
+                self.pre_norm = None
+
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                bias_attr=bias_attr)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+
+        assert H == self.img_size[0] and W == self.img_size[1], \
+        f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+
+        if self.use_pre_norm:
+            x = self.pre_norm(x)
+
+        x = self.proj(x).flatten(2).transpose((0, 2, 1))  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class FocalTransformer(nn.Layer):
+    r"""Focal Transformer:Focal Self-attention for Local-Global Interactions in Vision Transformer
+    Args:
+        img_size (int | tuple(int)): Input image size. Default 224
+        patch_size (int | tuple(int)): Patch size. Default: 4
+        in_chans (int): Number of input image channels. Default: 3
+        num_classes (int): Number of classes for classification head. Default: 1000
+        embed_dim (int): Patch embedding dimension. Default: 96
+        depths (tuple(int)): Depth of each Focal Transformer layer.
+        num_heads (tuple(int)): Number of attention heads in different layers.
+        window_size (int): Window size. Default: 7
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
+        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True
+        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None
+        drop_rate (float): Dropout rate. Default: 0
+        attn_drop_rate (float): Attention dropout rate. Default: 0
+        drop_path_rate (float): Stochastic depth rate. Default: 0.1
+        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
+        ape (bool): If True, add absolute position embedding to
+                    the patch embedding. Default: False
+        patch_norm (bool): If True, add normalization after patch embedding. Default: True
+        use_shift (bool): Whether to use window shift proposed by Swin Transformer.
+                          We observe that using shift or not does not make difference to
+                          our Focal Transformer.Default: False
+        focal_stages (list): Which stages to perform focal attention.
+                             Default: [0, 1, 2, 3], means all stages
+        focal_levels (list): How many focal levels at all stages.
+                             Note that this excludes the finest-grain level. Default: [1, 1, 1, 1]
+        focal_windows (list): The focal window size at all stages. Default: [7, 5, 3, 1]
+        expand_stages (list): Which stages to expand the finest grain window.
+                              Default: [0, 1, 2, 3], means all stages
+        expand_sizes (list): The expand size for the finest grain level. Default: [3, 3, 3, 3]
+        expand_layer (str): Which layers we want to expand the window for the finest grain leve.
+                            This can save computational and memory cost
+                            without the loss of performance. Default: "all"
+        use_conv_embed (bool): Whether use convolutional embedding.
+                               We noted that using convolutional embedding
+                               usually improve the performance,
+                               but we do not use it by default. Default: False
+        use_layerscale (bool): Whether use layerscale proposed in CaiT. Default: False
+        layerscale_value (float): Value for layer scale. Default: 1e-4
+        use_pre_norm (bool): Whether use pre-norm in patch merging/embedding layer to
+                             control the feature magtigute. Default: False
+    """
+    def __init__(self, config):
+        super().__init__()
+
+
+        self.focal_stages = config.MODEL.TRANS.FOCAL_STAGES # [0, 1, 2, 3]
+        self.focal_levels = config.MODEL.TRANS.FOCAL_LEVELS # [1, 1, 1, 1]
+        self.focal_windows = config.MODEL.TRANS.FOCAL_WINDOWS #[7, 5, 3, 1],
+        self.expand_stages = config.MODEL.TRANS.EXPAND_STAGES # [0, 1, 2, 3]
+        self.expand_sizes = config.MODEL.TRANS.EXPAND_SIZES # [3, 3, 3, 3]
+        self.window_size = config.MODEL.TRANS.WINDOW_SIZE
+        self.num_heads = config.MODEL.TRANS.NUM_HEADS
+        self.depths = config.MODEL.TRANS.STAGE_DEPTHS
+        self.num_classes = config.DATA.NUM_CLASSES
+        self.num_layers = len(self.depths)
+        self.embed_dim = config.MODEL.TRANS.EMBED_DIM
+        self.ape = False
+        self.patch_norm = True
+        self.num_features = int(self.embed_dim * 2 ** (self.num_layers - 1))
+        self.mlp_ratio = config.MODEL.TRANS.MLP_RATIO
+        self.qkv_bias = config.MODEL.TRANS.QKV_BIAS
+        self.qk_scale = config.MODEL.TRANS.QK_SCALE
+        self.drop_rate = config.MODEL.DROPOUT
+        self.attn_drop_rate = config.MODEL.ATTENTION_DROPOUT
+        self.drop_path_rate = config.MODEL.DROP_PATH
+        self.out_indices = config.MODEL.ENCODER.OUT_INDICES
+        self.use_conv_embed = config.MODEL.TRANS.USE_CONV_EMBED
+
+        weight_attr, bias_attr = self._init_weights()
+
+        # split image into patches using either non-overlapped embedding
+        # or overlapped embedding
+        self.patch_embed = PatchEmbed(
+            img_size=config.DATA.CROP_SIZE,
+            patch_size=config.MODEL.TRANS.PATCH_SIZE,
+            in_chans=config.MODEL.TRANS.IN_CHANNELS,
+            embed_dim=self.embed_dim,
+            use_conv_embed=self.use_conv_embed, is_stem=True,
+            norm_layer=nn.LayerNorm if self.patch_norm else None)
+
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = paddle.create_parameter(shape=(1, num_patches, self.embed_dim),
+                                      dtype=np.float32, is_bias=True,
+                                      attr=nn.initializer.TruncatedNormal(std=.02))
+
+        self.pos_drop = nn.Dropout(p=self.drop_rate)
+
+        # stochastic depth
+        # stochastic depth decay rule
+        dpr = [x.numpy().item() for x in paddle.linspace(0, self.drop_path_rate, sum(self.depths))]
+
+        # build layers
+        self.layers = nn.LayerList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(dim=int(self.embed_dim * 2 ** i_layer),
+                    input_resolution=(patches_resolution[0] // (2 ** i_layer),
+                                        patches_resolution[1] // (2 ** i_layer)),
+                    depth=self.depths[i_layer],
+                    num_heads=self.num_heads[i_layer],
+                    window_size=self.window_size,
+                    mlp_ratio=self.mlp_ratio,
+                    qkv_bias=self.qkv_bias,
+                    qk_scale=self.qk_scale,
+                    drop=self.drop_rate,
+                    attn_drop=self.attn_drop_rate,
+                    drop_path=dpr[sum(self.depths[:i_layer]):sum(self.depths[:i_layer + 1])],
+                    norm_layer=nn.LayerNorm,
+                    pool_method="fc" if i_layer in self.focal_stages else "none",
+                    downsample=PatchEmbed if (i_layer < self.num_layers - 1) else None,
+                    focal_level=self.focal_levels[i_layer],
+                    focal_window=self.focal_windows[i_layer],
+                    expand_size=self.expand_sizes[i_layer],
+                    expand_layer="all",
+                    use_conv_embed=self.use_conv_embed,
+                    use_shift=False,
+                    use_pre_norm=False,
+                    use_layerscale=False,
+                    layerscale_value=1e-4)
+            self.layers.append(layer)
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    def no_weight_decay_keywords(self):
+        return {'relative_position_bias_table',
+                'relative_position_bias_table_to_neighbors',
+                'relative_position_bias_table_to_windows'}
+
+
+    def forward(self, x):
+        outs = []
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+        for idx in range(len(self.layers)):
+            x_out, x = self.layers[idx](x)
+            if idx in self.out_indices:
+                outs.append(x_out)
+        return outs
diff --git a/semantic_segmentation/src/models/backbones/mix_transformer.py b/semantic_segmentation/src/models/backbones/mix_transformer.py
index 81d0a70a..3bde8119 100644
--- a/semantic_segmentation/src/models/backbones/mix_transformer.py
+++ b/semantic_segmentation/src/models/backbones/mix_transformer.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Implement Mix Transformer of Segformer
 Segformer: https://arxiv.org/abs/2105.15203 
diff --git a/semantic_segmentation/src/models/backbones/swin_transformer.py b/semantic_segmentation/src/models/backbones/swin_transformer.py
index 48cccb30..a172776f 100644
--- a/semantic_segmentation/src/models/backbones/swin_transformer.py
+++ b/semantic_segmentation/src/models/backbones/swin_transformer.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Implement Transformer Class for Swin Transformer
 """
diff --git a/semantic_segmentation/src/models/backbones/vit.py b/semantic_segmentation/src/models/backbones/vit.py
index 6094f013..a4d00805 100644
--- a/semantic_segmentation/src/models/backbones/vit.py
+++ b/semantic_segmentation/src/models/backbones/vit.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Implement Transformer Class for ViT
 """
diff --git a/semantic_segmentation/src/models/backbones/vit_mla.py b/semantic_segmentation/src/models/backbones/vit_mla.py
index 1dcff07b..d1c0adb9 100644
--- a/semantic_segmentation/src/models/backbones/vit_mla.py
+++ b/semantic_segmentation/src/models/backbones/vit_mla.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Implement Transformer Class for ViT_MLA
 """
diff --git a/semantic_segmentation/src/models/decoders/dpt_head.py b/semantic_segmentation/src/models/decoders/dpt_head.py
index 2f102de4..10ed7e0c 100644
--- a/semantic_segmentation/src/models/decoders/dpt_head.py
+++ b/semantic_segmentation/src/models/decoders/dpt_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import copy
 import paddle
 import paddle.nn as nn
diff --git a/semantic_segmentation/src/models/decoders/fcn_head.py b/semantic_segmentation/src/models/decoders/fcn_head.py
index bd9cf0bc..b4c03630 100644
--- a/semantic_segmentation/src/models/decoders/fcn_head.py
+++ b/semantic_segmentation/src/models/decoders/fcn_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
@@ -57,7 +71,7 @@ def forward(self, x):
         up_resolution = [ self.up_ratio*item for item in x.shape[2:]]
         output = self.convs(x)
         if self.concat_input:
-            output = slef.conv_cat(paddle.concat([x, output], axis=1))
+            output = self.conv_cat(paddle.concat([x, output], axis=1))
         if self.dropout is not None:
             output = self.dropout(output)
         output = self.conv_seg(output)
diff --git a/semantic_segmentation/src/models/decoders/psp_head.py b/semantic_segmentation/src/models/decoders/psp_head.py
index d12ff5ae..3b91e353 100644
--- a/semantic_segmentation/src/models/decoders/psp_head.py
+++ b/semantic_segmentation/src/models/decoders/psp_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/models/decoders/segformer_head.py b/semantic_segmentation/src/models/decoders/segformer_head.py
index ec2d2655..2c012db4 100644
--- a/semantic_segmentation/src/models/decoders/segformer_head.py
+++ b/semantic_segmentation/src/models/decoders/segformer_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Implement The all MLP Head of Segformer
 Segformer: https://arxiv.org/abs/2105.15203 
diff --git a/semantic_segmentation/src/models/decoders/segmentor_head.py b/semantic_segmentation/src/models/decoders/segmentor_head.py
index b7d79805..d4f27216 100644
--- a/semantic_segmentation/src/models/decoders/segmentor_head.py
+++ b/semantic_segmentation/src/models/decoders/segmentor_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import copy
 import paddle
 import paddle.nn as nn
@@ -88,4 +102,4 @@ def forward(self, x):
         masks = masks.reshape((masks.shape[0], H, W, masks.shape[-1]))
         masks = masks.transpose((0, 3, 1, 2))
 
-        return masks
\ No newline at end of file
+        return masks
diff --git a/semantic_segmentation/src/models/decoders/trans2seg_head.py b/semantic_segmentation/src/models/decoders/trans2seg_head.py
index 229d8ed2..cca1ecf6 100644
--- a/semantic_segmentation/src/models/decoders/trans2seg_head.py
+++ b/semantic_segmentation/src/models/decoders/trans2seg_head.py
@@ -1,3 +1,16 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
 import paddle
 import paddle.nn as nn
diff --git a/semantic_segmentation/src/models/decoders/uper_head.py b/semantic_segmentation/src/models/decoders/uper_head.py
index 716864c0..191c4b0c 100644
--- a/semantic_segmentation/src/models/decoders/uper_head.py
+++ b/semantic_segmentation/src/models/decoders/uper_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/models/decoders/vit_mla_auxi_head.py b/semantic_segmentation/src/models/decoders/vit_mla_auxi_head.py
index bffaf9e8..a6056b3a 100644
--- a/semantic_segmentation/src/models/decoders/vit_mla_auxi_head.py
+++ b/semantic_segmentation/src/models/decoders/vit_mla_auxi_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/models/decoders/vit_mla_head.py b/semantic_segmentation/src/models/decoders/vit_mla_head.py
index a9750d6d..114a3330 100644
--- a/semantic_segmentation/src/models/decoders/vit_mla_head.py
+++ b/semantic_segmentation/src/models/decoders/vit_mla_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/models/decoders/vit_up_head.py b/semantic_segmentation/src/models/decoders/vit_up_head.py
index c8fe077a..6426e819 100644
--- a/semantic_segmentation/src/models/decoders/vit_up_head.py
+++ b/semantic_segmentation/src/models/decoders/vit_up_head.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/models/dpt.py b/semantic_segmentation/src/models/dpt.py
index 6f5dc7bb..66297e4e 100644
--- a/semantic_segmentation/src/models/dpt.py
+++ b/semantic_segmentation/src/models/dpt.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """                                                                                                                                                                                                                 
 This module implements DPT
 Vision Transformers for Dense Prediction
@@ -24,4 +38,4 @@ def forward(self, inputs):
 
     def init__decoder_lr_coef(self, coef):
         for param in self.head.parameters():
-            param.optimize_attr['learning_rate'] = coef
\ No newline at end of file
+            param.optimize_attr['learning_rate'] = coef
diff --git a/semantic_segmentation/src/models/focal.py b/semantic_segmentation/src/models/focal.py
new file mode 100644
index 00000000..12899418
--- /dev/null
+++ b/semantic_segmentation/src/models/focal.py
@@ -0,0 +1,1143 @@
+# Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import numpy as np
+import paddle
+from paddle import nn
+from paddle.nn import functional as F
+
+class DropPath(nn.Layer):
+    r"""DropPath class"""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def drop_path(self, inputs):
+        """drop path op
+        Args:
+            input: tensor with arbitrary shape
+            drop_prob: float number of drop path probability, default: 0.0
+            training: bool, set if current mode is training, default: False
+        Returns:
+            output: output tensor after drop path
+        """
+        # if prob is 0 or eval mode, return original input
+        if self.drop_prob == 0. or not self.training:
+            return inputs
+        keep_prob = 1 - self.drop_prob
+        keep_prob = paddle.to_tensor(keep_prob, dtype='float32')
+        shape = (inputs.shape[0], ) + (1, ) * (inputs.ndim - 1)  # shape=(N, 1, 1, 1)
+        random_tensor = keep_prob + paddle.rand(shape, dtype=inputs.dtype)
+        random_tensor = random_tensor.floor() # mask
+        # divide is to keep same output expectation
+        output = inputs.divide(keep_prob) * random_tensor
+        return output
+
+    def forward(self, inputs):
+        return self.drop_path(inputs)
+
+
+class Identity(nn.Layer):
+    r""" Identity layer
+        The output of this layer is the input without any change.
+        Use this layer to avoid using 'if' condition in forward methods
+    """
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x
+
+
+class Mlp(nn.Layer):
+    r""" MLP module
+    """
+    def __init__(self, in_features, hidden_features=None,
+                 out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+
+        weight_attr, bias_attr = self._init_weights()
+
+        self.fc1 = nn.Linear(in_features, hidden_features,
+                             weight_attr=weight_attr, bias_attr=bias_attr)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features,
+                             weight_attr=weight_attr, bias_attr=bias_attr)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+def window_partition(x, window_size):
+    r"""window_partition
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+    Returns:
+        windows: (num_windows*B, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.reshape((B, H // window_size, window_size, W // window_size, window_size, C))
+    windows = x.transpose((0, 1, 3, 2, 4, 5)).reshape((-1, window_size, window_size, C))
+    return windows
+
+
+def window_partition_noreshape(x, window_size):
+    r"""window_partition_noreshape
+    Args:
+        x: (B, H, W, C)
+        window_size (int): window size
+    Returns:
+        windows: (B, num_windows_h, num_windows_w, window_size, window_size, C)
+    """
+    B, H, W, C = x.shape
+    x = x.reshape((B, H // window_size, window_size, W // window_size, window_size, C))
+    windows = x.transpose((0, 1, 3, 2, 4, 5))
+    return windows
+
+
+def window_reverse(windows, window_size, H, W):
+    r"""window_reverse
+    Args:
+        windows: (num_windows*B, window_size, window_size, C)
+        window_size (int): Window size
+        H (int): Height of image
+        W (int): Width of image
+    Returns:
+        x: (B, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.reshape((B, H // window_size, W // window_size, window_size, window_size, -1))
+    x = x.transpose((0, 1, 3, 2, 4, 5)).reshape((B, H, W, -1))
+    return x
+
+
+def get_relative_position_index(q_windows, k_windows):
+    r"""
+    Args:
+        q_windows: tuple (query_window_height, query_window_width)
+        k_windows: tuple (key_window_height, key_window_width)
+    Returns:
+        relative_position_index:
+            query_window_height*query_window_width, key_window_height*key_window_width
+    """
+    # get pair-wise relative position index for each token inside the window
+    coords_h_q = paddle.arange(q_windows[0])
+    coords_w_q = paddle.arange(q_windows[1])
+    coords_q = paddle.stack(paddle.meshgrid([coords_h_q, coords_w_q]))  # 2, Wh_q, Ww_q
+
+    coords_h_k = paddle.arange(k_windows[0])
+    coords_w_k = paddle.arange(k_windows[1])
+    coords_k = paddle.stack(paddle.meshgrid([coords_h_k, coords_w_k]))  # 2, Wh, Ww
+
+    coords_flatten_q = paddle.flatten(coords_q, 1)  # 2, Wh_q*Ww_q
+    coords_flatten_k = paddle.flatten(coords_k, 1)  # 2, Wh_k*Ww_k
+
+    coords_flatten_q = paddle.unsqueeze(coords_flatten_q, axis=-1) # 2, Wh_q*Ww_q, 1
+    coords_flatten_k = paddle.unsqueeze(coords_flatten_k, axis=-2) # 2, 1, Ww_k*Ww_k
+
+    relative_coords = coords_flatten_q - coords_flatten_k  # 2, Wh_q*Ww_q, Wh_k*Ww_k
+    relative_coords = relative_coords.transpose((1, 2, 0))  # Wh_q*Ww_q, Wh_k*Ww_k, 2
+    relative_coords[:, :, 0] += k_windows[0] - 1  # shift to start from 0
+    relative_coords[:, :, 1] += k_windows[1] - 1
+    relative_coords[:, :, 0] *= (q_windows[1] + k_windows[1]) - 1
+    relative_position_index = relative_coords.sum(-1)  #  Wh_q*Ww_q, Wh_k*Ww_k
+    return relative_position_index
+
+
+class WindowAttention(nn.Layer):
+    r""" Window based multi-head self attention (W-MSA) module with relative position bias.
+    Args:
+        dim (int): Number of input channels.
+        expand_size (int): The expand size at focal level 1.
+        window_size (tuple[int]): The height and width of the window.
+        focal_window (int): Focal region size.
+        focal_level (int): Focal attention level.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value.
+                                    Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set
+        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0
+        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
+        pool_method (str): window pooling method. Default: none
+    """
+    def __init__(self, dim, expand_size, window_size, focal_window,
+                    focal_level, num_heads, qkv_bias=True, qk_scale=None,
+                    attn_drop=0., proj_drop=0., pool_method="none"):
+        super().__init__()
+        self.dim = dim
+        self.expand_size = expand_size
+        self.window_size = window_size  # Wh, Ww
+        self.pool_method = pool_method
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+
+        weight_attr, bias_attr = self._init_weights()
+
+        # define a parameter table of relative position bias for each window
+        self.relative_position_bias_table = paddle.create_parameter(
+            shape=((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads),
+            dtype=np.float32, is_bias=True)  # 2*Wh-1 * 2*Ww-1, nH
+
+        # get pair-wise relative position index for each token inside the window
+        coords_h = paddle.arange(self.window_size[0])
+        coords_w = paddle.arange(self.window_size[1])
+        coords = paddle.stack(paddle.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = paddle.flatten(coords, 1)  # 2, Wh*Ww
+
+        coords_flatten_l = paddle.unsqueeze(coords_flatten, axis=-1) # 2, Wh*Ww, 1
+        coords_flatten_r = paddle.unsqueeze(coords_flatten, axis=-2) # 2, 1, Wh*Ww
+        relative_coords = coords_flatten_l - coords_flatten_r  # 2, Wh*Ww, Wh*Ww
+
+        relative_coords = relative_coords.transpose((1, 2, 0))  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer("relative_position_index", relative_position_index)
+
+        if self.expand_size > 0 and focal_level > 0:
+            # define a parameter table of position bias between window
+            # and its fine-grained surroundings
+            self.window_size_of_key = self.window_size[0] * \
+                self.window_size[1] if self.expand_size == 0 else \
+                (4 * self.window_size[0] * self.window_size[1] - 4 * \
+                (self.window_size[0] -  self.expand_size) * \
+                (self.window_size[0] -  self.expand_size))
+
+            self.relative_position_bias_table_to_neighbors = paddle.create_parameter(
+                        shape=(1, num_heads,
+                        self.window_size[0] * self.window_size[1], self.window_size_of_key),
+                        dtype=np.float32, is_bias=True,
+                        attr=nn.initializer.TruncatedNormal(std=.02))  # Wh*Ww, nH, nSurrounding
+
+            # get mask for rolled k and rolled v
+            mask_tl = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_tl[:-self.expand_size, :-self.expand_size] = 0
+            mask_tr = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_tr[:-self.expand_size, self.expand_size:] = 0
+            mask_bl = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_bl[self.expand_size:, :-self.expand_size] = 0
+            mask_br = paddle.ones((self.window_size[0], self.window_size[1]))
+            mask_br[self.expand_size:, self.expand_size:] = 0
+            mask_rolled = paddle.stack((mask_tl, mask_tr, mask_bl, mask_br), 0).flatten(0)
+            self.register_buffer("valid_ind_rolled", paddle.flatten(mask_rolled.nonzero()))
+
+        if pool_method != "none" and focal_level > 1:
+            self.relative_position_bias_table_to_windows = nn.ParameterList()
+            self.unfolds = nn.LayerList()
+
+            # build relative position bias between local patch and pooled windows
+            for k in range(focal_level-1):
+                stride = 2**k
+                kernel_size = 2*(self.focal_window // 2) + 2**k + (2**k-1)
+                # define unfolding operations
+                self.unfolds.append(
+                    nn.Unfold(
+                    kernel_sizes=[kernel_size, kernel_size],
+                    strides=stride, paddings=kernel_size // 2)
+                )
+
+                # define relative position bias table
+                relative_position_bias_table_to_windows = paddle.create_parameter(
+                        shape=(self.num_heads,
+                        (self.window_size[0] + self.focal_window + 2**k - 2) * \
+                        (self.window_size[1] + self.focal_window + 2**k - 2), ),
+                        dtype=np.float32, is_bias=True,
+                        attr=nn.initializer.TruncatedNormal(std=.02))  # Wh*Ww, nH, nSurrounding
+                self.relative_position_bias_table_to_windows.append(
+                            relative_position_bias_table_to_windows)
+
+                # define relative position bias index
+                relative_position_index_k = get_relative_position_index(self.window_size,
+                                            (self.focal_window + 2**k - 1,
+                                            self.focal_window + 2**k - 1))
+                self.register_buffer("relative_position_index_{}".format(k),
+                                                    relative_position_index_k)
+
+                # define unfolding index for focal_level > 0
+                if k > 0:
+                    mask = paddle.zeros(kernel_size, kernel_size)
+                    mask[(2**k)-1:, (2**k)-1:] = 1
+                    self.register_buffer("valid_ind_unfold_{}".format(k),
+                                paddle.flatten(mask.flatten(0).nonzero()))
+
+        self.qkv = nn.Linear(dim, dim * 3, weight_attr=weight_attr,
+                             bias_attr=bias_attr if qkv_bias else False)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim, weight_attr=weight_attr, bias_attr=bias_attr)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(axis=-1)
+
+    def forward(self, x_all, mask_all=None):
+        """
+        Args:
+            x_all (list[Tensors]): input features at different granularity
+            mask_all (list[Tensors/None]): masks for input features at different granularity
+        """
+        x = x_all[0]
+
+        B, nH, nW, C = x.shape
+        qkv = self.qkv(x).reshape((B, nH, nW, 3, C)).transpose((3, 0, 1, 2, 4))
+        q, k, v = qkv[0], qkv[1], qkv[2]  # B, nH, nW, C
+
+
+        # partition q map
+        q_windows = window_partition(q, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+        k_windows = window_partition(k, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+        v_windows = window_partition(v, self.window_size[0]).reshape(
+                    (-1, self.window_size[0] * self.window_size[0],
+                    self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+
+        if self.expand_size > 0 and self.focal_level > 0:
+            k_tl = paddle.roll(k, shifts=(-self.expand_size, -self.expand_size), axis=(1, 2))
+            v_tl = paddle.roll(v, shifts=(-self.expand_size, -self.expand_size), axis=(1, 2))
+
+            k_tr = paddle.roll(k, shifts=(-self.expand_size, self.expand_size), axis=(1, 2))
+            v_tr = paddle.roll(v, shifts=(-self.expand_size, self.expand_size), axis=(1, 2))
+
+            k_bl = paddle.roll(k, shifts=(self.expand_size, -self.expand_size), axis=(1, 2))
+            v_bl = paddle.roll(v, shifts=(self.expand_size, -self.expand_size), axis=(1, 2))
+
+            k_br = paddle.roll(k, shifts=(self.expand_size, self.expand_size), axis=(1, 2))
+            v_br = paddle.roll(v, shifts=(self.expand_size, self.expand_size), axis=(1, 2))
+
+
+            k_tl_windows = window_partition(k_tl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_tr_windows = window_partition(k_tr, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_bl_windows = window_partition(k_bl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            k_br_windows = window_partition(k_br, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+
+            v_tl_windows = window_partition(v_tl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_tr_windows = window_partition(v_tr, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_bl_windows = window_partition(v_bl, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+            v_br_windows = window_partition(v_br, self.window_size[0]).reshape(
+            (-1, self.window_size[0] * self.window_size[0], self.num_heads, C // self.num_heads))
+
+            k_rolled = paddle.concat((k_tl_windows, k_tr_windows,
+                       k_bl_windows, k_br_windows), 1).transpose((0, 2, 1, 3))
+            v_rolled = paddle.concat((v_tl_windows, v_tr_windows,
+                       v_bl_windows, v_br_windows), 1).transpose((0, 2, 1, 3))
+
+            # mask out tokens in current window
+            k_rolled = paddle.gather(k_rolled, self.valid_ind_rolled.flatten(), axis=2)
+            v_rolled = paddle.gather(v_rolled, self.valid_ind_rolled.flatten(), axis=2)
+            k_rolled = paddle.concat((k_windows, k_rolled), 2)
+            v_rolled = paddle.concat((v_windows, v_rolled), 2)
+        else:
+            k_rolled = k_windows
+            v_rolled = v_windows
+
+        if self.pool_method != "none" and self.focal_level > 1:
+            k_pooled = []
+            v_pooled = []
+            for k in range(self.focal_level-1):
+                stride = 2**k
+                x_window_pooled = x_all[k+1]  # B, nWh, nWw, C
+                nWh, nWw = x_window_pooled.shape[1:3] 
+
+                # generate mask for pooled windows
+                mask = paddle.ones(shape=(nWh, nWw)).astype(x_window_pooled.dtype)
+                unfolded_mask = self.unfolds[k](mask.unsqueeze(0).unsqueeze(1)).reshape((
+                    1, 1, self.unfolds[k].kernel_sizes[0],
+                    self.unfolds[k].kernel_sizes[1], -1)).transpose((0, 4, 2, 3, 1)).\
+                    reshape((nWh*nWw // stride // stride, -1, 1))
+
+                if k > 0:
+                    valid_ind_unfold_k = getattr(self, "valid_ind_unfold_{}".format(k))
+                    unfolded_mask = unfolded_mask[:, valid_ind_unfold_k]
+
+                x_window_masks = unfolded_mask.flatten(1).unsqueeze(0)
+                # from numpy to paddle
+                x_window_masks = x_window_masks.numpy()
+                x_window_masks[x_window_masks==0] = -100.0
+                x_window_masks[x_window_masks>0] = 0.0
+                x_window_masks = paddle.to_tensor(x_window_masks.astype(np.float32))         
+                mask_all[k+1] = x_window_masks
+
+                # generate k and v for pooled windows                
+                qkv_pooled = self.qkv(x_window_pooled).reshape((B, nWh, nWw, 3, C)).transpose(
+                                                                              (3, 0, 4, 1, 2))
+                k_pooled_k, v_pooled_k = qkv_pooled[1], qkv_pooled[2]  # B, C, nWh, nWw
+
+                # (B x (nH*nW)) x nHeads x (unfold_wsize x unfold_wsize) x head_dim
+                k_pooled_k = self.unfolds[k](k_pooled_k).reshape((
+                            B, C, self.unfolds[k].kernel_sizes[0],
+                            self.unfolds[k].kernel_sizes[1], -1)).transpose(
+                            (0, 4, 2, 3, 1)).reshape((-1,
+                            self.unfolds[k].kernel_sizes[0]*self.unfolds[k].kernel_sizes[1],
+                            self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+                v_pooled_k = self.unfolds[k](v_pooled_k).reshape((
+                            B, C, self.unfolds[k].kernel_sizes[0],
+                            self.unfolds[k].kernel_sizes[1], -1)).transpose(
+                            (0, 4, 2, 3, 1)).reshape((-1,
+                            self.unfolds[k].kernel_sizes[0]*self.unfolds[k].kernel_sizes[1],
+                            self.num_heads, C // self.num_heads)).transpose((0, 2, 1, 3))
+
+                if k > 0:
+                    k_pooled_k = k_pooled_k[:, :, valid_ind_unfold_k]
+                    v_pooled_k = v_pooled_k[:, :, valid_ind_unfold_k]
+
+                k_pooled += [k_pooled_k]
+                v_pooled += [v_pooled_k]
+            k_all = paddle.concat([k_rolled] + k_pooled, 2)
+            v_all = paddle.concat([v_rolled] + v_pooled, 2)
+        else:
+            k_all = k_rolled
+            v_all = v_rolled
+
+        N = k_all.shape[-2]
+        q_windows = q_windows * self.scale
+        # B*nW, nHead, window_size*window_size, focal_window_size*focal_window_size
+        attn = (paddle.mm(q_windows, k_all.transpose((0, 1, 3, 2))))
+
+        window_area = self.window_size[0] * self.window_size[1]        
+        window_area_rolled = k_rolled.shape[2]
+
+        # add relative position bias for tokens inside window
+        # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.flatten()].reshape((
+            self.window_size[0] * self.window_size[1], 
+            self.window_size[0] * self.window_size[1], -1))
+        # nH, Wh*Ww, Wh*Ww
+        relative_position_bias = relative_position_bias.transpose((2, 0, 1))
+        attn[:, :, :window_area, :window_area] = attn[:, :, :window_area, :window_area] + \
+                                                 relative_position_bias.unsqueeze(0)
+
+        # add relative position bias for patches inside a window
+        if self.expand_size > 0 and self.focal_level > 0:
+            attn[:, :, :window_area, window_area:window_area_rolled] = attn[:, :, :window_area,
+                window_area:window_area_rolled] + self.relative_position_bias_table_to_neighbors
+
+        if self.pool_method != "none" and self.focal_level > 1:
+            # add relative position bias for different windows in an image        
+            offset = window_area_rolled
+            for k in range(self.focal_level-1):
+                # add relative position bias
+                relative_position_index_k = getattr(self, 'relative_position_index_{}'.format(k))
+                relative_position_bias_to_windows = self.relative_position_bias_table_to_windows[k]
+                relative_position_bias_to_windows = paddle.gather(
+                    relative_position_bias_to_windows, relative_position_index_k.flatten(),
+                    axis=1).reshape((-1, self.window_size[0] * self.window_size[1],
+                    (self.focal_window+2**k-1)**2,
+                )) # nH, NWh*NWw,focal_region*focal_region
+                attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] = \
+                    attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] + \
+                    relative_position_bias_to_windows.unsqueeze(0)
+                # add attentional mask
+                if mask_all[k+1] is not None:
+                    attn[:, :, :window_area, offset:(offset + (self.focal_window+2**k-1)**2)] = \
+                                    attn[:, :, :window_area, offset:(offset + \
+                                    (self.focal_window+2**k-1)**2)] + \
+                                    paddle.stack([mask_all[k+1].unsqueeze(-2).unsqueeze(-2)] * \
+                                    (attn.shape[0] // mask_all[k+1].shape[1]), axis=0).\
+                                    reshape((-1, 1, 1, mask_all[k+1].shape[-1]))
+                offset += (self.focal_window+2**k-1)**2
+
+        if mask_all[0] is not None:
+            nW = mask_all[0].shape[0]
+            attn = attn.reshape((attn.shape[0] // nW, nW, self.num_heads, window_area, N))
+            attn[:, :, :, :, :window_area] = attn[:, :, :, :, :window_area] + \
+                                             mask_all[0].unsqueeze(0).unsqueeze(2)
+            attn = attn.reshape((-1, self.num_heads, window_area, N))
+            attn = self.softmax(attn)
+        else:          
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+        x = paddle.mm(attn, v_all).transpose((0, 2, 1, 3)).reshape(
+                                   (attn.shape[0], window_area, C))
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class FocalTransformerBlock(nn.Layer):
+    r""" Focal Transformer Block.
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resulotion.
+        num_heads (int): Number of attention heads.
+        window_size (int): Window size.
+        expand_size (int): expand size at first focal level (finest level).
+        shift_size (int): Shift size for SW-MSA.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value.
+                                   Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float, optional): Stochastic depth rate. Default: 0.0
+        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+        pool_method (str): window pooling method. Default: none, options: [none|fc|conv]
+        focal_level (int): number of focal levels. Default: 1.
+        focal_window (int): region size of focal attention. Default: 1
+        use_layerscale (bool): whether use layer scale for training stability. Default: False
+        layerscale_value (float): scaling value for layer scale. Default: 1e-4
+    """
+    def __init__(self, dim, input_resolution, num_heads, window_size=7, expand_size=0,
+                 shift_size=0, mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0.,
+                 attn_drop=0., drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,
+                 pool_method="none", focal_level=1, focal_window=1, use_layerscale=False,
+                 layerscale_value=1e-4):
+        super(FocalTransformerBlock, self).__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.expand_size = expand_size
+        self.mlp_ratio = mlp_ratio
+        self.pool_method = pool_method
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        self.use_layerscale = use_layerscale
+
+        weight_attr, bias_attr = self._init_weights()
+
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't partition windows
+            self.expand_size = 0
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"
+
+        self.window_size_glo = self.window_size
+
+        self.pool_layers = nn.LayerList()
+        if self.pool_method != "none":
+            for k in range(self.focal_level-1):
+                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
+                if self.pool_method == "fc":
+                    self.pool_layers.append(nn.Linear(window_size_glo * window_size_glo, 1,
+                                            weight_attr=weight_attr, bias_attr=bias_attr))
+                    self.pool_layers[len(self.pool_layers)-1].weight.set_value(
+                        paddle.full_like(self.pool_layers[len(self.pool_layers)-1].weight,
+                        1./(window_size_glo * window_size_glo))
+                    )
+                    self.pool_layers[len(self.pool_layers)-1].bias.set_value(
+                        paddle.full_like(self.pool_layers[len(self.pool_layers)-1].bias, 0)
+                    )
+                    
+                elif self.pool_method == "conv":
+                    self.pool_layers.append(nn.Conv2D(dim, dim,
+                                            kernel_size=window_size_glo,
+                                            stride=window_size_glo, groups=dim))
+
+        self.norm1 = norm_layer(dim,
+                     weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                     bias_attr=bias_attr)
+
+        self.attn = WindowAttention(
+            dim, expand_size=self.expand_size,
+            window_size=(self.window_size, self.window_size),
+            focal_window=focal_window, focal_level=focal_level,
+            num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+            attn_drop=attn_drop,proj_drop=drop, pool_method=pool_method)
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else Identity()
+        self.norm2 = norm_layer(dim,
+                     weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                     bias_attr=bias_attr)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim,
+                   act_layer=act_layer, drop=drop)
+
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = paddle.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            # nW, window_size, window_size, 1
+            mask_windows = window_partition(img_mask, self.window_size)
+            mask_windows = mask_windows.reshape((-1, self.window_size * self.window_size))
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            # from numpy to paddle
+            attn_mask = attn_mask.numpy()
+            attn_mask[attn_mask!=0] = -100.0
+            attn_mask[attn_mask==0] = 0.0
+            attn_mask = paddle.to_tensor(attn_mask.astype(np.float32))
+        else:
+            attn_mask = None
+        self.register_buffer("attn_mask", attn_mask)
+
+        if self.use_layerscale:
+            self.gamma_1 = paddle.create_parameter(layerscale_value * paddle.ones((dim)))
+            self.gamma_2 = paddle.create_parameter(layerscale_value * paddle.ones((dim)))
+
+    def forward(self, x):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+        assert L == H * W, "input feature has wrong size"
+
+        shortcut = x
+        x = self.norm1(x)
+        x = x.reshape((B, H, W, C))
+
+        # pad feature maps to multiples of window size
+        pad_l = pad_t = 0
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+        if pad_r > 0 or pad_b > 0:
+            x = F.pad(x, [0, 0, pad_l, pad_r, pad_t, pad_b, 0, 0])
+
+        B, H, W, C = x.shape 
+
+        if self.shift_size > 0:
+            shifted_x = paddle.roll(x, shifts=(-self.shift_size, -self.shift_size), axis=(1, 2))
+        else:
+            shifted_x = x
+
+        x_windows_all = [shifted_x]
+        x_window_masks_all = [self.attn_mask]
+
+        if self.focal_level > 1 and self.pool_method != "none":
+            # if we add coarser granularity and the pool method is not none
+            for k in range(self.focal_level-1):
+                window_size_glo = math.floor(self.window_size_glo / (2 ** k))
+                pooled_h = math.ceil(H / self.window_size) * (2 ** k)
+                pooled_w = math.ceil(W / self.window_size) * (2 ** k)
+                H_pool = pooled_h * window_size_glo
+                W_pool = pooled_w * window_size_glo
+
+                x_level_k = shifted_x
+                # trim or pad shifted_x depending on the required size
+                if H > H_pool:
+                    trim_t = (H - H_pool) // 2
+                    trim_b = H - H_pool - trim_t
+                    x_level_k = x_level_k[:, trim_t:-trim_b]
+                elif H < H_pool:
+                    pad_t = (H_pool - H) // 2
+                    pad_b = H_pool - H - pad_t
+                    x_level_k = F.pad(x_level_k, [0, 0, 0, 0, pad_t, pad_b, 0, 0])
+
+                if W > W_pool:
+                    trim_l = (W - W_pool) // 2
+                    trim_r = W - W_pool - trim_l
+                    x_level_k = x_level_k[:, :, trim_l:-trim_r]
+                elif W < W_pool:
+                    pad_l = (W_pool - W) // 2
+                    pad_r = W_pool - W - pad_l
+                    x_level_k = F.pad(x_level_k, [0, 0, pad_l, pad_r, 0, 0, 0, 0])
+
+                # B, nw, nw, window_size, window_size, C
+                x_windows_noreshape = window_partition_noreshape(x_level_k, window_size_glo)
+                nWh, nWw = x_windows_noreshape.shape[1:3]
+
+                if self.pool_method == "mean":
+                    # B, nWh, nWw, C
+                    x_windows_pooled = x_windows_noreshape.mean([3, 4])
+                elif self.pool_method == "max":
+                    # B, nWh, nWw, C
+                    x_windows_pooled = x_windows_noreshape.max(-2)[0].max(-2)[0].reshape(
+                                       (B, nWh, nWw, C))
+                elif self.pool_method == "fc":
+                    # B, nWh, nWw, C, wsize**2
+                    x_windows_noreshape = x_windows_noreshape.reshape((B, nWh, nWw,
+                                          window_size_glo*window_size_glo, C)).transpose(
+                                          (0, 1, 2, 4, 3))
+                    # B, nWh, nWw, C
+                    x_windows_pooled = self.pool_layers[k](x_windows_noreshape).flatten(-2)
+                elif self.pool_method == "conv":
+                    # B * nw * nw, C, wsize, wsize
+                    x_windows_noreshape = x_windows_noreshape.reshape((-1,
+                                          window_size_glo, window_size_glo, C)).transpose(
+                                          (0, 3, 1, 2))
+                    # B, nWh, nWw, C
+                    x_windows_pooled = self.pool_layers[k](x_windows_noreshape).reshape(
+                                       (B, nWh, nWw, C))
+
+                x_windows_all += [x_windows_pooled]
+                x_window_masks_all += [None]
+        
+        # nW*B, window_size*window_size, C
+        attn_windows = self.attn(x_windows_all, mask_all=x_window_masks_all)
+        attn_windows = attn_windows[:, :self.window_size ** 2]
+        
+        x = self.merge_windows_and_ffn(attn_windows, shortcut, B, C, H, W)
+
+        return x
+
+
+    def merge_windows_and_ffn(self, attn_windows, shortcut, B, C, H, W):
+        attn_windows = attn_windows.reshape((-1, self.window_size, self.window_size, C))
+        shifted_x = window_reverse(attn_windows, self.window_size, H, W)  # B H' W' C
+
+        # reverse cyclic shift
+        x = self.reverse_cyclic_shift(shifted_x)
+        x = x[:, :self.input_resolution[0], :self.input_resolution[1]].reshape((B, -1, C))
+
+        # FFN
+        x = self.ffn(x, shortcut)
+
+        return x
+
+
+    def reverse_cyclic_shift(self, shifted_x):
+        if self.shift_size > 0:
+            x = paddle.roll(shifted_x, shifts=(self.shift_size, self.shift_size), axis=(1, 2))
+        else:
+            x = shifted_x
+        return x
+
+
+    def ffn(self, x, shortcut):
+        x = shortcut + self.drop_path(x if (not self.use_layerscale) else (self.gamma_1 * x))
+        x = x + self.drop_path(self.mlp(self.norm2(x)) if (not self.use_layerscale) else (
+                                                  self.gamma_2 * self.mlp(self.norm2(x))))
+        return x
+
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class PatchMerging(nn.Layer):
+    r""" Patch Merging Layer.
+    Args:
+        img_size (tuple[int]): Resolution of input feature.
+        in_chans (int): Number of input channels.
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+    """
+    def __init__(self, img_size, in_chans=3, norm_layer=nn.LayerNorm, **kwargs):
+        super().__init__()
+        self.input_resolution = img_size
+        self.dim = in_chans
+        weight_attr, bias_attr = self._init_weights()
+        self.reduction = nn.Linear(4 * in_chans, 2 * in_chans, bias_attr=False)
+        self.norm = norm_layer(4 * in_chans,
+                    weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                    bias_attr=bias_attr)
+
+    def forward(self, x):
+        """
+        x: B, C, H, W
+        """
+        B, C, H, W = x.shape 
+
+        x = x.transpose((0, 2, 3, 1))
+
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = paddle.concat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.reshape((B, -1, 4 * C))  # B H/2*W/2 4*C
+
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class BasicLayer(nn.Layer):
+    """ A basic Focal Transformer layer for one stage.
+    Args:
+        dim (int): Number of input channels.
+        input_resolution (tuple[int]): Input resolution.
+        depth (int): Number of blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): Local window size.
+        expand_size (int): expand size for focal level 1.
+        expand_layer (str): expand layer. Default: all
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.0.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value.
+                                   Default: True
+        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
+        drop (float, optional): Dropout rate. Default: 0.0
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0
+        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0
+        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
+        pool_method (str): Window pooling method. Default: none.
+        focal_level (int): Number of focal levels. Default: 1.
+        focal_window (int): region size at each focal level. Default: 1.
+        use_conv_embed (bool): whether use overlapped convolutional patch embedding layer.
+                               Default: False
+        use_shift (bool): Whether use window shift as in Swin Transformer. Default: False
+        use_pre_norm (bool): Whether use pre-norm before patch embedding projection for stability.
+                             Default: False
+        downsample (nn.Module | None, optional): Downsample layer at the end of the layer.
+                             Default: None
+        use_layerscale (bool): Whether use layer scale for stability. Default: False.
+        layerscale_value (float): Layerscale value. Default: 1e-4.
+    """
+    def __init__(self, dim, input_resolution, depth, num_heads, window_size,
+                 expand_size, expand_layer="all", mlp_ratio=4., qkv_bias=True,
+                 qk_scale=None, drop=0., attn_drop=0., drop_path=0., norm_layer=nn.LayerNorm,
+                 pool_method="none", focal_level=1, focal_window=1, use_conv_embed=False,
+                 use_shift=False, use_pre_norm=False,downsample=None, use_layerscale=False,
+                 layerscale_value=1e-4):
+
+        super(BasicLayer, self).__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+
+        if expand_layer == "even":
+            expand_factor = 0
+        elif expand_layer == "odd":
+            expand_factor = 1
+        elif expand_layer == "all":
+            expand_factor = -1
+
+        # build blocks
+        self.blocks = nn.LayerList([
+            FocalTransformerBlock(dim=dim, input_resolution=input_resolution,
+                num_heads=num_heads, window_size=window_size,
+                shift_size=(0 if (i % 2 == 0) else window_size // 2) if use_shift else 0,
+                expand_size=0 if (i % 2 == expand_factor) else expand_size,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop,
+                attn_drop=attn_drop,
+                drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,
+                norm_layer=norm_layer,
+                pool_method=pool_method,
+                focal_level=focal_level,
+                focal_window=focal_window,
+                use_layerscale=use_layerscale,
+                layerscale_value=layerscale_value)
+            for i in range(depth)])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                img_size=input_resolution, patch_size=2, in_chans=dim, embed_dim=2*dim,
+                use_conv_embed=use_conv_embed, norm_layer=norm_layer, use_pre_norm=use_pre_norm,
+                is_stem=False
+            )
+        else:
+            self.downsample = None
+
+    def forward(self, x):
+        for blk in self.blocks:
+            x = blk(x)
+
+        if self.downsample is not None:
+            x = x.reshape((x.shape[0], self.input_resolution[0],
+                           self.input_resolution[1], -1)).transpose((0, 3, 1, 2))
+            x = self.downsample(x)
+        return x
+
+
+class PatchEmbed(nn.Layer):
+    r""" Image to Patch Embedding
+    Args:
+        img_size (int): Image size.  Default: 224.
+        patch_size (int): Patch token size. Default: 4.
+        in_chans (int): Number of input image channels. Default: 3.
+        embed_dim (int): Number of linear projection output channels. Default: 96.
+        use_conv_embed (bool): Wherther use overlapped convolutional embedding layer.
+                               Default: False.
+        norm_layer (nn.Module, optional): Normalization layer. Default: None
+        use_pre_norm (bool): Whether use pre-normalization before projection. Default: False
+        is_stem (bool): Whether current patch embedding is stem. Default: False
+    """
+
+    def __init__(self, img_size=(224, 224), patch_size=4, in_chans=3, embed_dim=96,
+                    use_conv_embed=False, norm_layer=None, use_pre_norm=False, is_stem=False):
+        super().__init__()
+        patch_size = (patch_size, patch_size)
+        patches_resolution = [img_size[0] // patch_size[0], img_size[1] // patch_size[1]]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+        self.use_pre_norm = use_pre_norm
+
+        weight_attr, bias_attr = self._init_weights()
+
+        if use_conv_embed:
+            # if we choose to use conv embedding,
+            # then we treat the stem and non-stem differently
+            if is_stem:
+                kernel_size = 7
+                padding = 2
+                stride = 4
+            else:
+                kernel_size = 3
+                padding = 1
+                stride = 2
+            self.proj = nn.Conv2D(in_chans, embed_dim, kernel_size=kernel_size,
+                                  stride=stride, padding=padding)
+        else:
+            self.proj = nn.Conv2D(in_chans, embed_dim,
+                                 kernel_size=patch_size, stride=patch_size)
+
+
+        if self.use_pre_norm:
+            if norm_layer is not None:
+                self.pre_norm = nn.GroupNorm(1, in_chans)
+            else:
+                self.pre_norm = None
+
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim,
+                weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+                bias_attr=bias_attr)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+
+        assert H == self.img_size[0] and W == self.img_size[1], \
+        f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+
+        if self.use_pre_norm:
+            x = self.pre_norm(x)
+
+        x = self.proj(x).flatten(2).transpose((0, 2, 1))  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+
+class FocalTransformer(nn.Layer):
+    r"""Focal Transformer:Focal Self-attention for Local-Global Interactions in Vision Transformer
+    Args:
+        img_size (int | tuple(int)): Input image size. Default 224
+        patch_size (int | tuple(int)): Patch size. Default: 4
+        in_chans (int): Number of input image channels. Default: 3
+        num_classes (int): Number of classes for classification head. Default: 1000
+        embed_dim (int): Patch embedding dimension. Default: 96
+        depths (tuple(int)): Depth of each Focal Transformer layer.
+        num_heads (tuple(int)): Number of attention heads in different layers.
+        window_size (int): Window size. Default: 7
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
+        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True
+        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None
+        drop_rate (float): Dropout rate. Default: 0
+        attn_drop_rate (float): Attention dropout rate. Default: 0
+        drop_path_rate (float): Stochastic depth rate. Default: 0.1
+        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
+        ape (bool): If True, add absolute position embedding to
+                    the patch embedding. Default: False
+        patch_norm (bool): If True, add normalization after patch embedding. Default: True
+        use_shift (bool): Whether to use window shift proposed by Swin Transformer.
+                          We observe that using shift or not does not make difference to
+                          our Focal Transformer.Default: False
+        focal_stages (list): Which stages to perform focal attention.
+                             Default: [0, 1, 2, 3], means all stages
+        focal_levels (list): How many focal levels at all stages.
+                             Note that this excludes the finest-grain level. Default: [1, 1, 1, 1]
+        focal_windows (list): The focal window size at all stages. Default: [7, 5, 3, 1]
+        expand_stages (list): Which stages to expand the finest grain window.
+                              Default: [0, 1, 2, 3], means all stages
+        expand_sizes (list): The expand size for the finest grain level. Default: [3, 3, 3, 3]
+        expand_layer (str): Which layers we want to expand the window for the finest grain leve.
+                            This can save computational and memory cost
+                            without the loss of performance. Default: "all"
+        use_conv_embed (bool): Whether use convolutional embedding.
+                               We noted that using convolutional embedding
+                               usually improve the performance,
+                               but we do not use it by default. Default: False
+        use_layerscale (bool): Whether use layerscale proposed in CaiT. Default: False
+        layerscale_value (float): Value for layer scale. Default: 1e-4
+        use_pre_norm (bool): Whether use pre-norm in patch merging/embedding layer to
+                             control the feature magtigute. Default: False
+    """
+    def __init__(self,
+                img_size=224,
+                patch_size=4,
+                in_chans=3,
+                num_classes=1000,
+                embed_dim=96,
+                depths=[2, 2, 6, 2],
+                num_heads=[3, 6, 12, 24],
+                window_size=7,
+                mlp_ratio=4.,
+                qkv_bias=True,
+                qk_scale=None,
+                drop_rate=0.,
+                attn_drop_rate=0.,
+                drop_path_rate=0.1,
+                norm_layer=nn.LayerNorm,
+                ape=False,
+                patch_norm=True,
+                use_shift=False,
+                focal_stages=[0, 1, 2, 3],
+                focal_levels=[1, 1, 1, 1],
+                focal_windows=[7, 5, 3, 1],
+                focal_pool="fc",
+                expand_stages=[0, 1, 2, 3],
+                expand_sizes=[3, 3, 3, 3],
+                expand_layer="all",
+                use_conv_embed=False,
+                use_layerscale=False,
+                layerscale_value=1e-4,
+                use_pre_norm=False,
+                **kwargs):
+        super().__init__()
+
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2 ** (self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+
+        weight_attr, bias_attr = self._init_weights()
+
+        # split image into patches using either non-overlapped embedding
+        # or overlapped embedding
+        self.patch_embed = PatchEmbed(
+            img_size=(img_size, img_size),
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            use_conv_embed=use_conv_embed, is_stem=True,
+            norm_layer=norm_layer if self.patch_norm else None)
+
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = paddle.create_parameter(shape=(1, num_patches, embed_dim),
+                                      dtype=np.float32, is_bias=True,
+                                      attr=nn.initializer.TruncatedNormal(std=.02))
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        # stochastic depth decay rule
+        dpr = [x.numpy().item() for x in paddle.linspace(0, drop_path_rate, sum(depths))]
+
+        # build layers
+        self.layers = nn.LayerList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(dim=int(embed_dim * 2 ** i_layer),
+                    input_resolution=(patches_resolution[0] // (2 ** i_layer),
+                                        patches_resolution[1] // (2 ** i_layer)),
+                    depth=depths[i_layer],
+                    num_heads=num_heads[i_layer],
+                    window_size=window_size,
+                    mlp_ratio=self.mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    qk_scale=qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                    norm_layer=norm_layer,
+                    pool_method=focal_pool if i_layer in focal_stages else "none",
+                    downsample=PatchEmbed if (i_layer < self.num_layers - 1) else None,
+                    focal_level=focal_levels[i_layer],
+                    focal_window=focal_windows[i_layer],
+                    expand_size=expand_sizes[i_layer],
+                    expand_layer=expand_layer,
+                    use_conv_embed=use_conv_embed,
+                    use_shift=use_shift,
+                    use_pre_norm=use_pre_norm,
+                    use_layerscale=use_layerscale,
+                    layerscale_value=layerscale_value)
+            self.layers.append(layer)
+
+        self.norm = norm_layer(self.num_features,
+            weight_attr=paddle.ParamAttr(initializer=nn.initializer.Constant(1.0)),
+            bias_attr=bias_attr)
+        self.avgpool = nn.AdaptiveAvgPool1D(1)
+        self.head = nn.Linear(self.num_features, num_classes,
+            weight_attr=weight_attr, bias_attr=bias_attr) if num_classes > 0 else Identity()
+
+
+    def _init_weights(self):
+        weight_attr = paddle.ParamAttr(initializer=nn.initializer.TruncatedNormal(std=.02))
+        bias_attr = paddle.ParamAttr(initializer=nn.initializer.Constant(0))
+        return weight_attr, bias_attr
+
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+
+    def no_weight_decay_keywords(self):
+        return {'relative_position_bias_table',
+                'relative_position_bias_table_to_neighbors',
+                'relative_position_bias_table_to_windows'}
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+        return x
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        return x
diff --git a/semantic_segmentation/src/models/losses/__init__.py b/semantic_segmentation/src/models/losses/__init__.py
index 3c4d7943..534dcb88 100644
--- a/semantic_segmentation/src/models/losses/__init__.py
+++ b/semantic_segmentation/src/models/losses/__init__.py
@@ -1 +1,12 @@
 from .cross_entropy_loss import CrossEntropyLoss
+from .mix_softmax_cross_entropy_loss import MixSoftmaxCrossEntropyLoss
+from .multi_cross_entropy_loss import MultiCrossEntropyLoss
+
+
+def get_loss_function(config):
+    if config.TRAIN.LOSS == 'CrossEntropyLoss':
+        return CrossEntropyLoss()
+    if config.TRAIN.LOSS == 'MixSoftmaxCrossEntropyLoss':
+        return MixSoftmaxCrossEntropyLoss(config)
+    if config.TRAIN.LOSS == 'MultiCrossEntropyLoss':
+        return MultiCrossEntropyLoss(config)
diff --git a/semantic_segmentation/src/models/losses/cross_entropy_loss.py b/semantic_segmentation/src/models/losses/cross_entropy_loss.py
index c90c069a..60669c1f 100644
--- a/semantic_segmentation/src/models/losses/cross_entropy_loss.py
+++ b/semantic_segmentation/src/models/losses/cross_entropy_loss.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 from paddle import nn
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/models/losses/mix_softmax_cross_entropy_loss.py b/semantic_segmentation/src/models/losses/mix_softmax_cross_entropy_loss.py
new file mode 100644
index 00000000..f5376ce5
--- /dev/null
+++ b/semantic_segmentation/src/models/losses/mix_softmax_cross_entropy_loss.py
@@ -0,0 +1,51 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MixSoftmaxCrossEntropyLoss Implement
+"""
+import paddle.nn as nn
+
+
+class MixSoftmaxCrossEntropyLoss(nn.CrossEntropyLoss):
+    """MixSoftmaxCrossEntropyLoss
+    """
+    def __init__(self, config):
+        self.ignore_index = config.TRAIN.IGNORE_INDEX
+        self.aux = config.MODEL.AUX.LOSS
+        self.aux_weight = config.MODEL.AUX.AUX_WEIGHT
+        super(MixSoftmaxCrossEntropyLoss, self).__init__(ignore_index=self.ignore_index, axis=1)
+
+    def _aux_forward(self, *inputs):
+        *preds, target = tuple(inputs)
+        loss = super(MixSoftmaxCrossEntropyLoss, self).forward(preds[0], target)
+        for i in range(1, len(preds)):
+            aux_loss = super(MixSoftmaxCrossEntropyLoss, self).forward(preds[i], target)
+            loss += self.aux_weight * aux_loss
+        return loss
+
+    def _multiple_forward(self, *inputs):
+        *preds, target = tuple(inputs)
+        loss = super(MixSoftmaxCrossEntropyLoss, self).forward(preds[0], target)
+        for i in range(1, len(preds)):
+            loss += super(MixSoftmaxCrossEntropyLoss, self).forward(preds[i], target)
+        return loss
+
+    def forward(self, *inputs):
+        preds, target = tuple(inputs)
+        inputs = tuple(list(preds) + [target])
+        if self.aux:
+            return self._aux_forward(*inputs)
+        if len(preds) > 1:
+            return self._multiple_forward(*inputs)
+        return super(MixSoftmaxCrossEntropyLoss, self).forward(*inputs)
diff --git a/semantic_segmentation/src/models/losses/multi_cross_entropy_loss.py b/semantic_segmentation/src/models/losses/multi_cross_entropy_loss.py
new file mode 100644
index 00000000..a7886723
--- /dev/null
+++ b/semantic_segmentation/src/models/losses/multi_cross_entropy_loss.py
@@ -0,0 +1,54 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""MultiCrossEntropyLoss Implement
+"""
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+def multi_cross_entropy_loss(pred_list,
+                             label,
+                             num_classes=60,
+                             weights=[1, 0.4, 0.4, 0.4, 0.4],
+                             ignore_index=255):
+    """MultiCrossEntropyLoss Function
+    """
+    label = paddle.reshape(label, [-1, 1]) # (b, h, w) -> (bhw, 1)
+    label.stop_gradient = True
+    loss = 0
+    for i, pred in enumerate(pred_list):
+        pred_i = paddle.transpose(pred, perm=[0, 2, 3, 1]) # (b,c,h,w) -> (b,h,w,c)
+        pred_i = paddle.reshape(pred_i, [-1, num_classes]) # (b,h,w,c) -> (bhw, c)
+        pred_i = F.softmax(pred_i, axis=1)
+        loss_i = F.cross_entropy(pred_i, label, ignore_index=ignore_index)
+        loss += weights[i]*loss_i
+    return loss
+
+
+class MultiCrossEntropyLoss(nn.Layer):
+    """MultiCrossEntropyLoss
+    """
+    def __init__(self, config):
+        super(MultiCrossEntropyLoss, self).__init__()
+        self.num_classes = config.DATA.NUM_CLASSES
+        self.weights = config.TRAIN.WEIGHTS
+        self.ignore_index = config.TRAIN.IGNORE_INDEX
+
+    def forward(self, logit, label):
+        return multi_cross_entropy_loss(pred_list=logit,
+                                        label=label,
+                                        num_classes=self.num_classes,
+                                        weights=self.weights,
+                                        ignore_index=self.ignore_index)
diff --git a/semantic_segmentation/src/models/segformer.py b/semantic_segmentation/src/models/segformer.py
index a3dfba0e..41679e14 100644
--- a/semantic_segmentation/src/models/segformer.py
+++ b/semantic_segmentation/src/models/segformer.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle.nn as nn
 
 from .backbones.mix_transformer import MixVisionTransformer
@@ -35,4 +49,4 @@ def __init__(self, config):
     def forward(self, inputs):
         features = self.backbone(inputs)
         out = self.decode_head(features)
-        return out
\ No newline at end of file
+        return out
diff --git a/semantic_segmentation/src/models/setr.py b/semantic_segmentation/src/models/setr.py
index 9ee31277..2aedef4a 100644
--- a/semantic_segmentation/src/models/setr.py
+++ b/semantic_segmentation/src/models/setr.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """                                                                                                                                                                                                                 
 This module implements SETR
 Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
diff --git a/semantic_segmentation/src/models/solver/__init__.py b/semantic_segmentation/src/models/solver/__init__.py
new file mode 100644
index 00000000..9d04d794
--- /dev/null
+++ b/semantic_segmentation/src/models/solver/__init__.py
@@ -0,0 +1,2 @@
+from .lr_scheduler import get_scheduler
+from .optimizer import get_optimizer
diff --git a/semantic_segmentation/src/models/solver/lr_scheduler.py b/semantic_segmentation/src/models/solver/lr_scheduler.py
new file mode 100644
index 00000000..1b5ebd1f
--- /dev/null
+++ b/semantic_segmentation/src/models/solver/lr_scheduler.py
@@ -0,0 +1,267 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Create Learning Rate Scheduler
+"""
+
+import math
+import logging
+from typing import List
+from bisect import bisect_right
+from paddle.optimizer.lr import LRScheduler
+import paddle.optimizer.lr as lr_scheduler
+
+
+_logger = logging.getLogger(__name__)
+
+
+class WarmupCosineLR(LRScheduler):
+    """WarmupCosineLR
+
+    Apply Cosine learning rate with linear warmup
+
+    Attributes:
+        learning_rate: float, learning rate
+        max_iters: int, can be total training steps
+        t_mul: float, hyper for learning rate, default: 1.0
+        lr_min: float, minimum learning rate, default: 0.0
+        decay_rate: float, decay rate for cosine, 
+                    if training steps greater than max_iters, default: 1.0
+        warmup_steps: int, warmup steps, default: 0
+        warmup_lr_init: float, initial warmup learning rate, default: 0.
+        last_epoch: int, the index of last epoch. Can be set to restart training
+                         default: -1, means initial learning rate
+        **kwargs
+
+    Examples:
+        import paddle
+        import matplotlib.pyplot as plt
+
+
+        linear = paddle.nn.Linear(10, 10)
+        scheduler = WarmupCosineLR(0.0005, 400, 1, 1e-05, 0.9, 40, 1e-06)
+        sgd = paddle.optimizer.SGD(learning_rate=scheduler,
+                                   parameters=linear.parameters())
+        lr = []
+        for epoch in range(400):
+            lr.append(sgd.get_lr())
+            scheduler.step()
+        plt.plot(lr)
+        plt.show()
+    """
+    def __init__(self,
+                 learning_rate: float,
+                 max_iters: int,
+                 t_mul: float = 1.,
+                 lr_min: float = 0.,
+                 decay_rate: float = 1.,
+                 warmup_steps=0,
+                 warmup_lr_init=0.0,
+                 warmup_prefix=False,
+                 cycle_limit=0,
+                 last_epoch: int = -1,
+                 verbose=False):
+        assert max_iters > 0
+        assert lr_min >= 0
+        if max_iters == 1 and t_mul == 1 and decay_rate == 1:
+            _logger.warning("Cosine annealing scheduler will have no effect on the learning "
+                            "rate since max_iters = t_mul = eta_mul = 1.")
+        self.max_iters = max_iters
+        self.t_mul = t_mul
+        self.lr_min = lr_min
+        self.decay_rate = decay_rate
+        self.cycle_limit = cycle_limit
+        self.warmup_steps = warmup_steps
+        self.warmup_lr_init = warmup_lr_init
+        self.warmup_prefix = warmup_prefix
+        if self.warmup_steps:
+            self.warmup_iters = (learning_rate - warmup_lr_init) / self.warmup_steps
+        else:
+            self.warmup_iters = 1
+        super(WarmupCosineLR, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        if self.last_epoch < self.warmup_steps:
+            lr = self.warmup_lr_init + self.last_epoch * self.warmup_iters
+        else:
+            if self.warmup_prefix:
+                self.last_epoch = self.last_epoch - self.warmup_steps
+            if self.t_mul != 1:
+                i = math.floor(math.log(1 - self.last_epoch / self.max_iters * (1 - self.t_mul), self.t_mul))
+                t_i = self.t_mul ** i * self.max_iters
+                t_curr = self.last_epoch - (1 - self.t_mul ** i) / (1 - self.t_mul) * self.max_iters
+            else:
+                i = self.last_epoch // self.max_iters
+                t_i = self.max_iters
+                t_curr = self.last_epoch - (self.max_iters * i)
+
+            gamma = self.decay_rate ** i
+            lr_min = self.lr_min * gamma
+            lr_max_values = self.base_lr * gamma
+            if self.cycle_limit == 0 or (self.cycle_limit > 0 and i < self.cycle_limit):
+                lr = lr_min + 0.5 * (lr_max_values - lr_min) * (1 + math.cos(math.pi * t_curr / t_i))
+            else:
+                lr = self.lr_min
+        return lr
+
+
+class WarmupPolyLR(LRScheduler):
+    """WarmupPolyLR
+
+    Apply PolynomialDecay learning rate with linear warmup
+
+    Attributes:
+        learning_rate: float, learning rate
+        warmup_lr_init: float, initial lrarning rate for warmup, default: 0.0
+        max_iters: int, total training steps
+        power: float, Power of polynomial, default: 0.9.
+        lr_min: float, minimum learning rate, default: 0.0
+        warmup_steps: int, warmup steps, default: 0
+        last_epoch: int, the index of last epoch. Can be set to restart training.
+                         default: -1, means initial learning rate
+
+    Examples:
+        import paddle.nn as nn
+        from paddle.optimizer import Adam
+        import matplotlib.pyplot as plt
+
+
+        scheduler = WarmupPolyLR(1e-4,
+                                 max_iters=200,
+                                 power=0.9,
+                                 warmup_steps=30)
+        opt = Adam(parameters=nn.Linear(10, 10).parameters(), learning_rate=scheduler)
+        lr = []
+        for epoch in range(0, 1000):
+            lr.append(opt.get_lr())
+            scheduler.step(epoch)
+        plt.plot(lr)
+        plt.show()
+    """
+    def __init__(self,
+                 learning_rate,
+                 warmup_lr_init=0,
+                 max_iters=0,
+                 power=0.9,
+                 warmup_steps=5,
+                 lr_min=0.0,
+                 last_epoch=-1,
+                 verbose=False):
+        self.base_lr = float(learning_rate)
+        self.warmup_lr_init = warmup_lr_init
+        self.max_iters = max_iters
+        self.power = power
+        self.warmup_steps = warmup_steps
+        self.lr_min = lr_min
+        assert learning_rate > lr_min, _logger.error('learning_rate must >= lr_min:{}'.format(lr_min))
+        super(WarmupPolyLR, self).__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self):
+        N = self.max_iters - self.warmup_steps
+        T = self.last_epoch - self.warmup_steps
+        if self.last_epoch < self.warmup_steps:
+            warmup_factor = float(self.last_epoch) / self.warmup_steps
+            if self.warmup_lr_init + (self.base_lr - self.warmup_lr_init) * warmup_factor <= self.lr_min:
+                return self.lr_min
+            return self.warmup_lr_init + (self.base_lr - self.warmup_lr_init) * warmup_factor
+        factor = pow(1 - T / N, self.power)
+        if isinstance(self.warmup_lr_init + (self.base_lr - self.warmup_lr_init) * factor, complex):
+            return self.lr_min
+        if self.warmup_lr_init + (self.base_lr - self.warmup_lr_init) * factor <= self.lr_min:
+            return self.lr_min
+        return self.warmup_lr_init + (self.base_lr - self.warmup_lr_init) * factor
+
+
+class WarmupMultiStepLR(LRScheduler):
+    """WarmupMultiStepLR
+
+    Apply MultiStep learning rate with linear warmup
+
+    Attributes:
+        learning_rate: float, learning rate
+        milestones: (tuple|list),  List or tuple of each boundaries. Must be increasing
+        gamma: float, the Ratio that the learning rate will be reduced, default: 0.1
+        warmup_steps: int, warmup steps, default: 0
+        last_epoch: int, the index of last epoch. Can be set to restart training.
+                         default: -1, means initial learning rate
+
+    Examples:
+        import paddle.nn as nn
+        from paddle.optimizer import Adam
+        import matplotlib.pyplot as plt
+
+
+        scheduler = WarmupMultiStepLR(0.001,
+                                      milestones=[50, 150, 200],
+                                      gamma=0.1,
+                                      warmup_steps=50)
+        opt = Adam(parameters=nn.Linear(10, 10).parameters(), learning_rate=scheduler)
+        lr = []
+        for epoch in range(0, 500):
+            lr.append(opt.get_lr())
+            scheduler.step(epoch)
+        plt.plot(lr)
+        plt.show()
+    """
+    def __init__(self,
+                 learning_rate: float,
+                 milestones: List[int],
+                 gamma: float = 0.1,
+                 warmup_steps: int = 1000,
+                 last_epoch: int = -1,
+                 verbose=False):
+        if not list(milestones) == sorted(milestones):
+            raise ValueError(
+                "Milestones should be a list of" " increasing integers. Got {}", milestones
+            )
+        self.milestones = milestones
+        self.gamma = gamma
+        self.warmup_steps = warmup_steps
+        self.base_lr = float(learning_rate)
+        assert self.warmup_steps <= milestones[0], _logger.error('warmup steps must >= milestones[0]')
+        super().__init__(learning_rate, last_epoch, verbose)
+
+    def get_lr(self) -> List[float]:
+        if self.last_epoch <= self.warmup_steps:
+            warmup_factor = float(self.last_epoch) / self.warmup_steps
+            return self.base_lr * warmup_factor
+        return self.base_lr * self.gamma ** bisect_right(self.milestones, self.last_epoch)
+
+
+def get_scheduler(config):
+    if config.TRAIN.LR_SCHEDULER.NAME == 'PolynomialDecay':
+        scheduler = lr_scheduler.PolynomialDecay(learning_rate=config.TRAIN.BASE_LR,
+                                                 decay_steps=config.TRAIN.ITERS,
+                                                 end_lr=config.TRAIN.END_LR,
+                                                 power=config.TRAIN.POWER)
+    elif config.TRAIN.LR_SCHEDULER.NAME == 'WarmupCosineLR':
+        scheduler = WarmupCosineLR(learning_rate=config.TRAIN.BASE_LR,
+                                   max_iters=config.TRAIN.ITERS,
+                                   warmup_steps=config.TRAIN.LR_SCHEDULER.WARM_UP_STEPS,
+                                   warmup_lr_init=config.TRAIN.LR_SCHEDULER.WARM_UP_LR_INIT,
+                                   lr_min=config.TRAIN.END_LR)
+    elif config.TRAIN.LR_SCHEDULER.NAME == 'WarmupPolyLR':
+        scheduler = WarmupPolyLR(learning_rate=config.TRAIN.BASE_LR,
+                                 max_iters=config.TRAIN.ITERS,
+                                 power=config.TRAIN.LR_SCHEDULER.POWER,
+                                 warmup_lr_init=config.TRAIN.LR_SCHEDULER.WARM_UP_LR_INIT,
+                                 warmup_steps=config.TRAIN.LR_SCHEDULER.WARM_UP_STEPS,
+                                 lr_min=config.TRAIN.END_LR)
+    elif config.TRAIN.LR_SCHEDULER.NAME == 'WarmupMultiStepLR':
+        scheduler = WarmupMultiStepLR(learning_rate=config.TRAIN.BASE_LR,
+                                      milestones=config.TRAIN.LR_SCHEDULER.MILESTONES,
+                                      gamma=config.TRAIN.LR_SCHEDULER.GAMMA,
+                                      warmup_steps=config.TRAIN.LR_SCHEDULER.WARM_UP_STEPS)
+    return scheduler
diff --git a/semantic_segmentation/src/models/solver/optimizer.py b/semantic_segmentation/src/models/solver/optimizer.py
new file mode 100644
index 00000000..b76f00a7
--- /dev/null
+++ b/semantic_segmentation/src/models/solver/optimizer.py
@@ -0,0 +1,73 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Create Optimizer
+"""
+from paddle import optimizer as optim
+from paddle.nn import ClipGradByGlobalNorm
+
+def get_optimizer(model, lr_scheduler, config):
+    """Get Optimizer for Training
+
+    Attributes:
+        model: nn.Layer, training model
+        lr_scheduler: (LRScheduler|float), learning rate scheduler
+        config: CfgNode, hyper for optimizer
+    """
+    opt_lower = config.TRAIN.OPTIMIZER.NAME.lower()
+    clip = None
+    if config.TRAIN.OPTIMIZER.GRAD_CLIP:
+        clip = ClipGradByGlobalNorm(config.TRAIN.OPTIMIZER.GRAD_CLIP)
+
+    if opt_lower == 'sgd':
+        optimizer = optim.Momentum(parameters=model.parameters(),
+                                   learning_rate=lr_scheduler,
+                                   momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+                                   use_nesterov=config.TRAIN.OPTIMIZER.NESTEROV,
+                                   weight_decay=float(config.TRAIN.OPTIMIZER.WEIGHT_DECAY),
+                                   grad_clip=clip)
+    elif opt_lower == 'adam':
+        optimizer = optim.Adam(parameters=model.parameters(),
+                               learning_rate=lr_scheduler,
+                               epsilon=config.TRAIN.OPTIMIZER.EPS,
+                               weight_decay=float(config.TRAIN.OPTIMIZER.WEIGHT_DECAY))
+    elif opt_lower == 'adamw':
+        optimizer = optim.AdamW(parameters=model.parameters(),
+                                learning_rate=lr_scheduler,
+                                beta1=config.TRAIN.OPTIMIZER.BETAS[0],
+                                beta2=config.TRAIN.OPTIMIZER.BETAS[1],
+                                epsilon=config.TRAIN.OPTIMIZER.EPS,
+                                weight_decay=float(config.TRAIN.OPTIMIZER.WEIGHT_DECAY),
+                                grad_clip=clip)
+    elif opt_lower == 'adadelta':
+        optimizer = optim.Adadelta(parameters=model.parameters(),
+                                   rho=config.TRAIN.OPTIMIZER.RHO,
+                                   learning_rate=lr_scheduler,
+                                   epsilon=config.TRAIN.OPTIMIZER.EPS,
+                                   weight_decay=float(config.TRAIN.OPTIMIZER.WEIGHT_DECAY),
+                                   grad_clip=clip)
+    elif opt_lower == 'rmsprop':
+        optimizer = optim.RMSProp(parameters=model.parameters(),
+                                  rho=config.TRAIN.OPTIMIZER.RHO,
+                                  momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
+                                  learning_rate=lr_scheduler,
+                                  centered=config.TRAIN.OPTIMIZER.CENTERTED,
+                                  epsilon=config.TRAIN.OPTIMIZER.EPS,
+                                  weight_decay=float(config.TRAIN.OPTIMIZER.WEIGHT_DECAY),
+                                  grad_clip=clip)
+    else:
+        raise ValueError("Expected optimizer method in [SGD, Adam, AdamW, Adadelta, RMSProp],"
+                         "but received {}".format(opt_lower))
+    return optimizer
diff --git a/semantic_segmentation/src/models/trans2seg.py b/semantic_segmentation/src/models/trans2seg.py
index 0cfeb6f4..caea796d 100644
--- a/semantic_segmentation/src/models/trans2seg.py
+++ b/semantic_segmentation/src/models/trans2seg.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
@@ -19,7 +33,7 @@ def __init__(self, config):
         c1_channels = 256
         c4_channels = 2048
         self.nclass = config.DATA.NUM_CLASSES
-        self.aux = config.TRAIN.LR_SCHEDULER.AUX
+        self.aux = config.MODEL.AUX.AUXIHEAD
         self.backbone = config.MODEL.ENCODER.TYPE.lower()
         
         # Create cnn encoder, the input image is fed to CNN to extract features
@@ -46,6 +60,7 @@ def __init__(self, config):
         # for transformer decoder, we specifically define a set of learnable class prototype embeddings as query,
         # the features from transformer encoder as key
         self.transformer_decoder = TransformerDecoder(
+                                     nclass=config.DATA.NUM_CLASSES,
                                      embed_dim=last_channels,
                                      depth=vit_params['DEPTH'],
                                      num_heads=vit_params['NUM_HEADS'],
diff --git a/semantic_segmentation/src/models/upernet.py b/semantic_segmentation/src/models/upernet.py
index a99f7eb6..7db7e124 100644
--- a/semantic_segmentation/src/models/upernet.py
+++ b/semantic_segmentation/src/models/upernet.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 This module implements UperNet
 Unified Perceptual Parsing for Scene Understanding
@@ -8,6 +22,8 @@
 import paddle
 import paddle.nn as nn
 from src.models.backbones import SwinTransformer
+from src.models.backbones import CSwinTransformer
+from src.models.backbones import FocalTransformer
 from src.models.decoders import UperHead, FCNHead
 
 
@@ -25,6 +41,10 @@ def __init__(self, config):
         super(UperNet, self).__init__()
         if config.MODEL.ENCODER.TYPE == "SwinTransformer":
             self.encoder = SwinTransformer(config)
+        elif config.MODEL.ENCODER.TYPE == "CSwinTransformer":
+            self.encoder = CSwinTransformer(config)
+        elif config.MODEL.ENCODER.TYPE == "FocalTransformer":
+            self.encoder = FocalTransformer(config)
         self.num_layers = len(config.MODEL.TRANS.STAGE_DEPTHS)
         self.auxi_head = config.MODEL.AUX.AUXIHEAD
         self.decoder_type = config.MODEL.DECODER_TYPE
diff --git a/semantic_segmentation/src/transforms/__init__.py b/semantic_segmentation/src/transforms/__init__.py
index c4f7d86e..a985022a 100644
--- a/semantic_segmentation/src/transforms/__init__.py
+++ b/semantic_segmentation/src/transforms/__init__.py
@@ -1,2 +1,25 @@
 from .transforms import *
 from . import functional
+
+def get_transforms(config):
+    if config.DATA.DATASET == "Trans10kV2":
+        transforms_train = [Resize(target_size=config.DATA.CROP_SIZE),
+                            RandomHorizontalFlip(prob=0.5),
+                            Normalize(mean=[123.675, 116.28, 103.53],
+                                      std=[58.395, 57.12, 57.375])]
+    elif config.DATA.DATASET == "ADE20K":
+        transforms_train = [ResizeStepScaling(min_scale_factor=0.5, 
+                                              max_scale_factor=2.0, 
+                                              scale_step_size=0.25),
+                            RandomPaddingCrop(crop_size=config.DATA.CROP_SIZE, 
+                                              img_padding_value=(123.675, 116.28, 103.53), 
+                                              label_padding_value=255),
+                            RandomHorizontalFlip(prob=0.5),
+                            RandomDistort(brightness_range=0.4, 
+                                          contrast_range=0.4, 
+                                          saturation_range=0.4),
+                            Normalize(mean=[123.675, 116.28, 103.53],
+                                      std=[58.395, 57.12, 57.375])]
+    else:
+        raise NotImplementedError("{} dataset is not supported".format(config.DATA.DATASET))
+    return transforms_train
diff --git a/semantic_segmentation/src/transforms/functional.py b/semantic_segmentation/src/transforms/functional.py
index 5bd0c18d..119eeae8 100644
--- a/semantic_segmentation/src/transforms/functional.py
+++ b/semantic_segmentation/src/transforms/functional.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import cv2
 import numpy as np
 from PIL import Image, ImageEnhance
diff --git a/semantic_segmentation/src/transforms/transforms.py b/semantic_segmentation/src/transforms/transforms.py
index bb1ac4c1..4c058971 100644
--- a/semantic_segmentation/src/transforms/transforms.py
+++ b/semantic_segmentation/src/transforms/transforms.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import random
 import numpy as np
 import cv2
diff --git a/semantic_segmentation/src/utils/__init__.py b/semantic_segmentation/src/utils/__init__.py
index e18e0c0b..ec86cb6a 100644
--- a/semantic_segmentation/src/utils/__init__.py
+++ b/semantic_segmentation/src/utils/__init__.py
@@ -3,3 +3,5 @@
 from .checkpoint import load_entire_model, load_pretrained_model, resume
 from .timer import TimeAverager, calculate_eta
 from . import vis
+from .multi_batch_collate import multi_val_fn
+from .dataloader import get_dataloader
diff --git a/semantic_segmentation/src/utils/checkpoint.py b/semantic_segmentation/src/utils/checkpoint.py
index 0ca530f8..7e2e43e7 100644
--- a/semantic_segmentation/src/utils/checkpoint.py
+++ b/semantic_segmentation/src/utils/checkpoint.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import math
 import os
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/utils/dataloader.py b/semantic_segmentation/src/utils/dataloader.py
new file mode 100644
index 00000000..4541663e
--- /dev/null
+++ b/semantic_segmentation/src/utils/dataloader.py
@@ -0,0 +1,75 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+code is heavily based on https://github.com/facebookresearch/maskrcnn-benchmark
+"""
+
+from paddle.io import BatchSampler, DistributedBatchSampler, DataLoader
+
+
+def get_dataloader(dataset,
+                   shuffle=False,
+                   batch_size=16,
+                   drop_last=False,
+                   num_workers=0,
+                   num_iters=None,
+                   start_iter=0):
+    """
+    get iterable data loader,
+    the lenth is num_iters.
+    """
+    # make num_iters is valid
+    if num_iters:
+        assert num_iters > 0
+    else:
+        assert num_iters is None
+    batch_sampler = DistributedBatchSampler(dataset=dataset,
+                                            batch_size=batch_size,
+                                            shuffle=shuffle,
+                                            drop_last=drop_last)
+    if num_iters:
+        batch_sampler = IterationBasedBatchSampler(batch_sampler=batch_sampler,
+                                                   num_iterations=num_iters,
+                                                   start_iter=start_iter)
+    dataloader = DataLoader(dataset=dataset,
+                            batch_sampler=batch_sampler,
+                            num_workers=num_workers)
+    return dataloader
+
+
+class IterationBasedBatchSampler(BatchSampler):
+    """
+    Wraps a BatchSampler, resampling from it until
+    a specified number of iterations have been sampled.
+    """
+    def __init__(self, batch_sampler, num_iterations, start_iter=0):
+        super(IterationBasedBatchSampler).__init__()
+        self.batch_sampler = batch_sampler
+        self.num_iterations = num_iterations
+        self.start_iter = start_iter
+
+    def __iter__(self):
+        iteration = self.start_iter
+        while iteration <= self.num_iterations:
+            if hasattr(self.batch_sampler, "set_epoch"):
+                self.batch_sampler.set_epoch(iteration)
+            for batch in self.batch_sampler:
+                iteration += 1
+                if iteration > self.num_iterations:
+                    break
+                yield batch
+
+    def __len__(self):
+        return self.num_iterations
diff --git a/semantic_segmentation/src/utils/logger.py b/semantic_segmentation/src/utils/logger.py
index a3b7e372..4dda841d 100644
--- a/semantic_segmentation/src/utils/logger.py
+++ b/semantic_segmentation/src/utils/logger.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import sys
 import time
 import paddle
diff --git a/semantic_segmentation/src/utils/metrics.py b/semantic_segmentation/src/utils/metrics.py
index 644989ca..1023ae18 100644
--- a/semantic_segmentation/src/utils/metrics.py
+++ b/semantic_segmentation/src/utils/metrics.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import numpy as np
 import paddle
 import paddle.nn.functional as F
diff --git a/semantic_segmentation/src/utils/multi_batch_collate.py b/semantic_segmentation/src/utils/multi_batch_collate.py
new file mode 100644
index 00000000..c22c59d1
--- /dev/null
+++ b/semantic_segmentation/src/utils/multi_batch_collate.py
@@ -0,0 +1,29 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+
+class multi_val_fn():
+    def __init__(self) -> None:
+        pass
+
+    def __call__(self, datas) -> tuple:
+        img_list = []
+        label_list = []
+        
+        for img, label in datas:
+            img_list.append(img)
+            label_list.append(label.astype('int64'))
+
+        return img_list, label_list
diff --git a/semantic_segmentation/src/utils/progbar.py b/semantic_segmentation/src/utils/progbar.py
index e639bce4..5006d78e 100644
--- a/semantic_segmentation/src/utils/progbar.py
+++ b/semantic_segmentation/src/utils/progbar.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import sys
 import time
diff --git a/semantic_segmentation/src/utils/timer.py b/semantic_segmentation/src/utils/timer.py
index c4b3a7ec..5ba64635 100644
--- a/semantic_segmentation/src/utils/timer.py
+++ b/semantic_segmentation/src/utils/timer.py
@@ -1,5 +1,18 @@
-import time
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 
+import time
 
 class TimeAverager(object):
     def __init__(self):
diff --git a/semantic_segmentation/src/utils/vis.py b/semantic_segmentation/src/utils/vis.py
index 2773307b..75acb93a 100644
--- a/semantic_segmentation/src/utils/vis.py
+++ b/semantic_segmentation/src/utils/vis.py
@@ -1,3 +1,17 @@
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import cv2
 import numpy as np
 
diff --git a/semantic_segmentation/train.py b/semantic_segmentation/train.py
index ccb12959..e568ba93 100644
--- a/semantic_segmentation/train.py
+++ b/semantic_segmentation/train.py
@@ -1,4 +1,17 @@
-#!/usr/bin/python3
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import time
 import random
@@ -12,7 +25,9 @@
 from src.datasets import get_dataset
 from src.models import get_model
 from src.transforms import *
-from src.utils import TimeAverager, calculate_eta, resume
+from src.utils import TimeAverager, calculate_eta, resume, get_dataloader
+from src.models.solver import get_scheduler, get_optimizer
+from src.models.losses import get_loss_function
 
 
 def parse_args():
@@ -27,65 +42,6 @@ def parse_args():
     )
     return parser.parse_args()
 
-def optimizer_setting(model, config):
-    if config.TRAIN.LR_SCHEDULER.NAME == "PolynomialDecay":
-        scheduler = paddle.optimizer.lr.PolynomialDecay(
-            learning_rate=config.TRAIN.BASE_LR, 
-            decay_steps=config.TRAIN.ITERS, 
-            end_lr=config.TRAIN.END_LR, 
-            power=config.TRAIN.POWER, 
-            cycle=False, 
-            last_epoch=-1, 
-            verbose=False)
-    else:
-        raise NotImplementedError(
-            f"Unsupported Scheduler: {config.TRAIN.LR_SCHEDULER}.")
-
-    if config.TRAIN.OPTIMIZER.NAME == "SGD":
-        optimizer = paddle.optimizer.Momentum(
-            parameters=model.parameters(),
-            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
-            momentum=config.TRAIN.OPTIMIZER.MOMENTUM)
-    elif config.TRAIN.OPTIMIZER.NAME == "ADAM":
-        optimizer = paddle.optimizer.Adam(
-            parameters=model.parameters(),
-            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            epsilon=config.TRAIN.OPTIMIZER.EPS,
-            weight_decay=config.TRAIN.WEIGHT_DECAY)
-    elif config.TRAIN.OPTIMIZER.NAME == "AdamW":
-        if config.TRAIN.GRAD_CLIP:
-            clip = paddle.nn.ClipGradByGlobalNorm(config.TRAIN.GRAD_CLIP)
-        else:
-            clip = None
-        optimizer = paddle.optimizer.AdamW(
-            parameters=model.parameters(),
-            learning_rate=scheduler if scheduler is not None else config.TRAIN.BASE_LR,
-            weight_decay=config.TRAIN.WEIGHT_DECAY,
-            beta1=config.TRAIN.OPTIMIZER.BETAS[0],
-            beta2=config.TRAIN.OPTIMIZER.BETAS[1],
-            epsilon=config.TRAIN.OPTIMIZER.EPS,
-            grad_clip=clip)
-    else:
-        raise NotImplementedError(
-            f"Unsupported Optimizer: {config.TRAIN.OPTIMIZER.NAME}.")
-    return optimizer
-
-def multi_cross_entropy_loss(pred_list, 
-                             label,
-                             num_classes=60, 
-                             weights=[1, 0.4, 0.4, 0.4, 0.4]):
-    label = paddle.reshape(label, [-1, 1]) # (b, h, w) -> (bhw, 1)                                      
-    label.stop_gradient = True
-    loss = 0
-    for i in range(len(pred_list)):
-        pred_i = paddle.transpose(pred_list[i], perm=[0, 2, 3, 1]) # (b,c,h,w) -> (b,h,w,c)
-        pred_i = paddle.reshape(pred_i, [-1, num_classes]) # (b,h,w,c) -> (bhw, c)
-        pred_i = nn.functional.softmax(pred_i, axis=1)  
-        loss_i = nn.functional.cross_entropy(pred_i, label, ignore_index=255)
-        loss += weights[i]*loss_i
-    return loss
-
 def main():
     config = get_config()
     args = parse_args()
@@ -97,35 +53,21 @@ def main():
     model.train()
     nranks = paddle.distributed.ParallelEnv().nranks
     local_rank = paddle.distributed.ParallelEnv().local_rank
+    # build scheduler
+    lr_scheduler = get_scheduler(config)
     # build optimizer
-    optimizer = optimizer_setting(model, config)
+    optimizer = get_optimizer(model, lr_scheduler, config)
+    # bulid train transforms
+    transforms_train = get_transforms(config)
     # build dataset_train
-    transforms_train = [ 
-        ResizeStepScaling(min_scale_factor=0.5, 
-                          max_scale_factor=2.0, 
-                          scale_step_size=0.25),
-        RandomPaddingCrop(crop_size=config.DATA.CROP_SIZE, 
-                          img_padding_value=(123.675, 116.28, 103.53), 
-                          label_padding_value=255),
-        RandomHorizontalFlip(prob=0.5),
-        RandomDistort(brightness_range=0.4, 
-                      contrast_range=0.4, 
-                      saturation_range=0.4),
-        Normalize(mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375])
-    ]
     dataset_train = get_dataset(config, data_transform=transforms_train, mode='train')
-    batch_sampler = paddle.io.DistributedBatchSampler(
-        dataset_train, 
-        batch_size=config.DATA.BATCH_SIZE, 
-        shuffle=True, drop_last=True)
-    train_loader = paddle.io.DataLoader(
-        dataset_train,
-        batch_sampler=batch_sampler,
-        num_workers=config.DATA.NUM_WORKERS,
-        return_list=True,
-    )
-    logger.info("train_loader.len= {}".format(len(train_loader)))
-    start_iter = 0
+    train_loader = get_dataloader(dataset=dataset_train,
+                                  shuffle=True,
+                                  batch_size=config.DATA.BATCH_SIZE,
+                                  num_iters=config.TRAIN.ITERS,
+                                  num_workers=config.DATA.NUM_WORKERS)
+    # build loss function
+    loss_func = get_loss_function(config)
     # TODO(wutianyiRosun@gmail.com): Resume from checkpoints, and update start_iter
 
     # build workspace for saving checkpoints
@@ -133,6 +75,8 @@ def main():
         if os.path.exists(config.SAVE_DIR):
             os.remove(config.SAVE_DIR)
         os.makedirs(config.SAVE_DIR)
+    logger.info("train_loader.len= {}".format(len(train_loader)))
+    start_iter = 0
     if nranks > 1:
         # Initialize parallel environment if not done.
         if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized():
@@ -143,74 +87,71 @@ def main():
             ddp_model = paddle.DataParallel(model)
     avg_loss = 0.0
     avg_loss_list = []
-    iters_per_epoch = len(batch_sampler)
+    iters_per_epoch = len(dataset_train) // config.DATA.BATCH_SIZE
     reader_cost_averager = TimeAverager()
     batch_cost_averager = TimeAverager()
     save_models = deque()
     batch_start = time.time()
     cur_iter = start_iter
     # begin training
-    while cur_iter < config.TRAIN.ITERS:
-        for data in train_loader:
-            cur_iter += 1
-            if cur_iter > config.TRAIN.ITERS:
-                break
-            reader_cost_averager.record(time.time() - batch_start)
-            images = data[0]
-            labels = data[1].astype('int64')
-            if nranks > 1:
-                logits_list = ddp_model(images)
-            else:
-                logits_list = model(images)
-            loss_list = multi_cross_entropy_loss(logits_list, labels, num_classes=config.DATA.NUM_CLASSES)
-            loss = sum(loss_list)
-            loss.backward()
-            optimizer.step()
-            lr = optimizer.get_lr()
-            if isinstance(optimizer._learning_rate,paddle.optimizer.lr.LRScheduler):
-                optimizer._learning_rate.step()
-            model.clear_gradients()
-            avg_loss += loss.numpy()[0]
-            if not avg_loss_list:
-                avg_loss_list = [l.numpy() for l in loss_list]
-            else:
-                for i in range(len(loss_list)):
-                    avg_loss_list[i] += loss_list[i].numpy()
-            batch_cost_averager.record(
-                time.time() - batch_start, num_samples=config.DATA.BATCH_SIZE)
-            if (cur_iter) % config.LOGGING_INFO_FREQ == 0 and local_rank == 0:
-                avg_loss /= config.LOGGING_INFO_FREQ
-                avg_loss_list = [l[0] / config.LOGGING_INFO_FREQ for l in avg_loss_list]
-                remain_iters = config.TRAIN.ITERS - cur_iter
-                avg_train_batch_cost = batch_cost_averager.get_average()
-                avg_train_reader_cost = reader_cost_averager.get_average()
-                eta = calculate_eta(remain_iters, avg_train_batch_cost)
-                logger.info("[TRAIN] epoch: {}, iter: {}/{}, loss: {:.4f}, lr: {:.8f}, batch_cost:\
-                    {:.4f}, reader_cost: {:.5f}, ips: {:.4f} samples/sec | ETA {}".format(
-                    (cur_iter - 1) // iters_per_epoch + 1, cur_iter, config.TRAIN.ITERS, avg_loss, 
-                    lr, avg_train_batch_cost, avg_train_reader_cost, 
-                    batch_cost_averager.get_ips_average(), eta))
-                avg_loss = 0.0
-                avg_loss_list = []
-                reader_cost_averager.reset()
-                batch_cost_averager.reset()
+    for data in train_loader:
+        cur_iter += 1
+        reader_cost_averager.record(time.time() - batch_start)
+        images = data[0]
+        labels = data[1].astype('int64')
+        if nranks > 1:
+            logits_list = ddp_model(images)
+        else:
+            logits_list = model(images)
+        loss_list = loss_func(logits_list, labels)
+        loss = sum(loss_list)
+        loss.backward()
+        optimizer.step()
+        lr = optimizer.get_lr()
+        if isinstance(optimizer._learning_rate,paddle.optimizer.lr.LRScheduler):
+            optimizer._learning_rate.step()
+        model.clear_gradients()
+        avg_loss += loss.numpy()[0]
+        if not avg_loss_list:
+            avg_loss_list = [l.numpy() for l in loss_list]
+        else:
+            for i in range(len(loss_list)):
+                avg_loss_list[i] += loss_list[i].numpy()
+        batch_cost_averager.record(
+            time.time() - batch_start, num_samples=config.DATA.BATCH_SIZE)
+        if (cur_iter) % config.LOGGING_INFO_FREQ == 0 and local_rank == 0:
+            avg_loss /= config.LOGGING_INFO_FREQ
+            avg_loss_list = [l[0] / config.LOGGING_INFO_FREQ for l in avg_loss_list]
+            remain_iters = config.TRAIN.ITERS - cur_iter
+            avg_train_batch_cost = batch_cost_averager.get_average()
+            avg_train_reader_cost = reader_cost_averager.get_average()
+            eta = calculate_eta(remain_iters, avg_train_batch_cost)
+            logger.info("[TRAIN] epoch: {}, iter: {}/{}, loss: {:.4f}, lr: {:.8f}, batch_cost:\
+                {:.4f}, reader_cost: {:.5f}, ips: {:.4f} samples/sec | ETA {}".format(
+                (cur_iter - 1) // iters_per_epoch + 1, cur_iter, config.TRAIN.ITERS, avg_loss, 
+                lr, avg_train_batch_cost, avg_train_reader_cost, 
+                batch_cost_averager.get_ips_average(), eta))
+            avg_loss = 0.0
+            avg_loss_list = []
+            reader_cost_averager.reset()
+            batch_cost_averager.reset()
 
-            if (cur_iter % config.SAVE_FREQ_CHECKPOINT == 0 or cur_iter == config.TRAIN.ITERS) and local_rank == 0:
-                current_save_weigth_file = os.path.join(config.SAVE_DIR,
-                    "iter_{}_model_state.pdparams".format(cur_iter))
-                current_save_opt_file = os.path.join(config.SAVE_DIR,
-                    "iter_{}_opt_state.pdopt".format(cur_iter))
-                paddle.save(model.state_dict(), current_save_weigth_file)
-                paddle.save(optimizer.state_dict(), current_save_opt_file)
-                save_models.append([current_save_weigth_file,
-                                    current_save_opt_file])
-                logger.info("saving the weights of model to {}".format(
-                    current_save_weigth_file))
-                if len(save_models) > config.KEEP_CHECKPOINT_MAX > 0:
-                    files_to_remove = save_models.popleft()
-                    os.remove(files_to_remove[0])
-                    os.remove(files_to_remove[1])
-            batch_start = time.time()
+        if (cur_iter % config.SAVE_FREQ_CHECKPOINT == 0 or cur_iter == config.TRAIN.ITERS) and local_rank == 0:
+            current_save_weigth_file = os.path.join(config.SAVE_DIR,
+                "iter_{}_model_state.pdparams".format(cur_iter))
+            current_save_opt_file = os.path.join(config.SAVE_DIR,
+                "iter_{}_opt_state.pdopt".format(cur_iter))
+            paddle.save(model.state_dict(), current_save_weigth_file)
+            paddle.save(optimizer.state_dict(), current_save_opt_file)
+            save_models.append([current_save_weigth_file,
+                                current_save_opt_file])
+            logger.info("saving the weights of model to {}".format(
+                current_save_weigth_file))
+            if len(save_models) > config.KEEP_CHECKPOINT_MAX > 0:
+                files_to_remove = save_models.popleft()
+                os.remove(files_to_remove[0])
+                os.remove(files_to_remove[1])
+        batch_start = time.time()
     time.sleep(1.0)
 
 if __name__ == '__main__':
diff --git a/semantic_segmentation/val.py b/semantic_segmentation/val.py
index 33b29451..4068d8d1 100644
--- a/semantic_segmentation/val.py
+++ b/semantic_segmentation/val.py
@@ -1,4 +1,17 @@
-#!/usr/bin/python3
+#  Copyright (c) 2021 PPViT Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import time
 import shutil
 import random
@@ -11,6 +24,7 @@
 from src.datasets import get_dataset
 from src.transforms import Resize, Normalize 
 from src.models import get_model
+from src.utils import multi_val_fn
 from src.utils import metrics, logger, progbar
 from src.utils import TimeAverager, calculate_eta
 from src.utils import load_entire_model, resume
@@ -72,8 +86,9 @@ def parse_args():
     dataset_val = get_dataset(config, data_transform=transforms_val, mode='val')
     batch_sampler = paddle.io.DistributedBatchSampler(dataset_val, 
         batch_size=config.DATA.BATCH_SIZE_VAL, shuffle=True, drop_last=True)
+    collate_fn = multi_val_fn()
     loader_val = paddle.io.DataLoader(dataset_val, batch_sampler=batch_sampler,
-        num_workers=config.DATA.NUM_WORKERS, return_list=True)
+        num_workers=config.DATA.NUM_WORKERS, return_list=True, collate_fn=collate_fn)
     total_iters = len(loader_val)
     # build workspace for saving checkpoints
     if not os.path.isdir(config.SAVE_DIR):
@@ -89,12 +104,14 @@ def parse_args():
     reader_cost_averager = TimeAverager()
     batch_cost_averager = TimeAverager()
     batch_start = time.time()
+    val_start_time = time.time()
     with paddle.no_grad():
         for iter, (img, label) in enumerate(loader_val):
             reader_cost_averager.record(time.time() - batch_start)
-            label = label.astype('int64')
+            batch_size = len(img)
+            #label = label.astype('int64')
             #print("img.shape: {}, label.shape: {}".format(img.shape, label.shape))
-            ori_shape = label.shape[-2:]
+            ori_shape = [l.shape[-2:] for l in label]
             if args.multi_scales == True:
                 pred = infer.ms_inference(
                     model=model,
@@ -120,34 +137,34 @@ def parse_args():
                     crop_size=config.VAL.CROP_SIZE,
                     num_classes=config.DATA.NUM_CLASSES,
                     rescale_from_ori=config.VAL.RESCALE_FROM_ORI)
-
-            intersect_area, pred_area, label_area = metrics.calculate_area(
-                pred,
-                label,
-                dataset_val.num_classes,
-                ignore_index=dataset_val.ignore_index)
-            # Gather from all ranks
-            if nranks > 1:
-                intersect_area_list = []
-                pred_area_list = []
-                label_area_list = []
-                paddle.distributed.all_gather(intersect_area_list, intersect_area)
-                paddle.distributed.all_gather(pred_area_list, pred_area)
-                paddle.distributed.all_gather(label_area_list, label_area)
-                # Some image has been evaluated and should be eliminated in last iter
-                if (iter + 1) * nranks > len(dataset_val):
-                    valid = len(dataset_val) - iter * nranks
-                    intersect_area_list = intersect_area_list[:valid]
-                    pred_area_list = pred_area_list[:valid]
-                    label_area_list = label_area_list[:valid]
-                for i in range(len(intersect_area_list)):
-                    intersect_area_all = intersect_area_all + intersect_area_list[i]
-                    pred_area_all = pred_area_all + pred_area_list[i]
-                    label_area_all = label_area_all + label_area_list[i]
-            else:
-                intersect_area_all = intersect_area_all + intersect_area
-                pred_area_all = pred_area_all + pred_area
-                label_area_all = label_area_all + label_area
+            for i in range(batch_size):
+                intersect_area, pred_area, label_area = metrics.calculate_area(
+                    pred[i],
+                    label[i],
+                    dataset_val.num_classes,
+                    ignore_index=dataset_val.ignore_index)
+                # Gather from all ranks
+                if nranks > 1:
+                    intersect_area_list = []
+                    pred_area_list = []
+                    label_area_list = []
+                    paddle.distributed.all_gather(intersect_area_list, intersect_area)
+                    paddle.distributed.all_gather(pred_area_list, pred_area)
+                    paddle.distributed.all_gather(label_area_list, label_area)
+                    # Some image has been evaluated and should be eliminated in last iter
+                    if (iter + 1) * nranks > len(dataset_val):
+                        valid = len(dataset_val) - iter * nranks
+                        intersect_area_list = intersect_area_list[:valid]
+                        pred_area_list = pred_area_list[:valid]
+                        label_area_list = label_area_list[:valid]
+                    for i in range(len(intersect_area_list)):
+                        intersect_area_all = intersect_area_all + intersect_area_list[i]
+                        pred_area_all = pred_area_all + pred_area_list[i]
+                        label_area_all = label_area_all + label_area_list[i]
+                else:
+                    intersect_area_all = intersect_area_all + intersect_area
+                    pred_area_all = pred_area_all + pred_area
+                    label_area_all = label_area_all + label_area
             batch_cost_averager.record(time.time() - batch_start, num_samples=len(label))
             batch_cost = batch_cost_averager.get_average()
             reader_cost = reader_cost_averager.get_average()
@@ -156,9 +173,13 @@ def parse_args():
             reader_cost_averager.reset()
             batch_cost_averager.reset()
             batch_start = time.time()
+    val_end_time = time.time()
+    val_time_cost = val_end_time - val_start_time
     class_iou, miou = metrics.mean_iou(intersect_area_all, pred_area_all, label_area_all)
     class_acc, acc = metrics.accuracy(intersect_area_all, pred_area_all)
     kappa = metrics.kappa(intersect_area_all, pred_area_all, label_area_all)
+    logger.info("Val_time_cost:   {}".format(val_time_cost))
     logger.info("[EVAL] #Images: {} mIoU: {:.4f} Acc: {:.4f} Kappa: {:.4f} ".format(len(dataset_val), miou, acc, kappa))
     logger.info("[EVAL] Class IoU: \n" + str(np.round(class_iou, 4)))
     logger.info("[EVAL] Class Acc: \n" + str(np.round(class_acc, 4)))
+